Keywords

2.1 Survival Time

In survival analysis, the term survival time refers to the time elapsed from an origin to the occurrence of an event. In many medical studies, the origin is the time at study entry which can be the start of a medical treatment, the initiation of a randomized experiment, or the operation date of surgery. In epidemiological and demographic studies, the origin is often the date of birth. The event may be the occurrence of death.

In medical research, the term overall survival refers to survival time measured from entry until death of a patient. For instance, to measure the effect of chemotherapy or radiotherapy in locally advanced head and neck cancer , researchers may use overall survival as the primary endpoint (Michiels et al. 2009). In this study, the origin is the start of randomization.

2.2 Kaplan–Meier Estimator and Survival Function

We shall introduce the random censorship model where we consider two random variables

  • \( T \): survival time

  • \( U \): censoring time

Due to censoring, either one of \( T \) or \( U \) is observed. One can observe \( T \) if death comes faster than censoring (\( T \le U \)). On the other hand, one cannot exactly observe \( T \) if censoring comes faster than death (\( U < T \)). Even if the exact value of \( T \) is unknown for the censored case, \( T \) is known to be greater than \( U \). What we observe are the first occurring time (\( \hbox{min} \{ T,U\} \)) and the censoring status (\( \{ T \le U\} \) or \( \{ U < T\} \)). The random censorship model typically assumes that \( T \) and \( U \) are independent, namely \( \Pr (T \in A,U \in B) = \Pr (T \in A)\Pr (U \in B) \) for sets \( A \) and \( B \).

Survival data consist of \( (t_{i} ,\delta_{i} ) \), \( i = 1, \ldots ,n \), where

  • \( t_{i} \): survival time or censoring time whichever comes first,

  • \( \delta_{i} \): censoring indicator (\( \delta_{i} = 1 \) if \( t_{i} \) is survival time, or \( \delta_{i} = 0 \) if \( t_{i} \) is censoring time).

Under the random censorship model, one can write \( t_{i} = \hbox{min} \{ T,U\} \) and \( \delta_{i} = {\mathbf{I}}(T \le U) \), where \( {\mathbf{I}}( \cdot ) \) is the indicator function. We shall assume that all the observed times to death are distinct (\( t_{i} \ne t_{j} \) whenever \( i \ne j \) and \( \delta_{i} = \delta_{j} = 1 \)), so that there is no ties in the death times. With the survival data, one can estimate the survival function \( S(t) \equiv \,\Pr (\;T > t\;) \) by the following estimator:

Definition 1

The KaplanMeier estimator (Kaplan and Meier 1958) is defined as

$$ \hat{S}(t) = \prod\limits_{{t_{i} \le t,\delta_{i} = 1}} {\left( {1 - \frac{1}{{n_{i} }}} \right)} ,\quad 0 \le t \le \mathop {\hbox{max} }\limits_{i} (t_{i} ) $$

where \( n_{i} = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}(\;t_{\ell } \ge t_{i} \;)} \) is the number at-risk at time \( t_{i} \); \( \hat{S}(\;t\;) = 1 \) if no death occurs up to time \( t \); \( \hat{S}(\;t\;) \) is undefined for \( t > \mathop {\hbox{max} }\limits_{i} (t_{i} ) \).

The derivation of the KaplanMeier estimator: Consider a survival function that is a decreasing step function with jumps only at points where a death occurs at observed times of death. Then, one can write (Exercise 1 in Sect. 2.9) the survival function in the form

$$ S(\;t\;) = \Pr (\;T > t\;) = \prod\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left( {1 - \frac{{\Pr (T = t_{i} )}}{{\Pr (T \ge t_{i} )}}} \right)} . $$

Second, suppose that \( T \) and \( U \) are independent. Then, one can write

$$ S(\;t\;) = \prod\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left( {1 - \frac{{\Pr (T = t_{i} ,\;U \ge t_{i} )}}{{\Pr (T \ge t_{i} ,\;U \ge t_{i} )}}} \right)} = \prod\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left( {1 - \frac{{\Pr (\hbox{min} \{ \;T,\;U\;\} = t_{i} ,\;T \le U)}}{{\Pr (\hbox{min} \{ \;T,\;U\;\} \ge t_{i} )}}} \right)} . $$

Finally, we replace the probability ratio of the last expression by its estimate to obtain

$$ \hat{S}(\;t\;) = \prod\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left( {1 - \frac{{\sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}(t_{\ell } = t_{i} ,\;\delta_{\ell } = 1)/n} }}{{\sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}(t_{\ell } \ge t_{i} )/n} }}} \right)} = \prod\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left( {1 - \frac{1}{{n_{i} }}} \right)} . $$

It is now clear that the Kaplan–Meier estimator relies on the independence assumption between \( T \) and \( U \).■

The KaplanMeier survival curve is defined as the plot of \( \hat{S}(\;t\;) \) against \( t \), starting with \( t = 0 \) and ending with \( t_{\hbox{max} } = \mathop {\hbox{max} }\limits_{i} (t_{i} ) \). The curve is a step function that jumps only at points where a death occurs. On the curve, censoring times are often indicated as the mark“+”.

If \( t_{\hbox{max} } = \mathop {\hbox{max} }\limits_{i} (t_{i} ) \) corresponds to time-to-death of a patient, then \( \hat{S}(\;t_{\hbox{max} } \;) = 0 \). If \( t_{\hbox{max} } = \mathop {\hbox{max} }\limits_{i} (t_{i} ) \) corresponds to censoring time of a patient, then \( \hat{S}(\;t_{\hbox{max} } \;) > 0 \). It is misleading to plot \( \hat{S}(\;t\;) \) only up to the largest death time \( \mathop {\hbox{max} }\limits_{{i;\;\delta_{i} = 1}} (t_{i} ) \), especially when many patients are alive beyond \( \mathop {\hbox{max} }\limits_{{i;\;\delta_{i} = 1}} (t_{i} ) \).

Survival data often include covariates, such as gender, tumor size, and cancer stage. With covariates, survival data consist of \( (t_{i} ,\delta_{i} ,{\mathbf{x}}_{i} ) \), \( i = 1, \ldots ,n \), where

  • \( {\mathbf{x}}_{i} = (x_{i1} , \ldots ,x_{ip} )^{\prime} \): \( p \)-dimensional covariates

In traditional survival analysis, the data is analyzed under the following assumption:

Independent censoring assumption: Survival time and censoring time are independent given covariates. That is, \( T \) and \( U \) are conditionally independent given \( {\mathbf{x}} \).

For a patient \( i \), one can define the survival function denoted as \( S(t|{\mathbf{x}}_{i} ) \equiv \,\Pr (\;T > t\;|\;{\mathbf{x}}_{i} \;) \) for \( t \ge 0 \). The survival function is the probability that the patient is alive at time \( t \). The survival function \( S(t|{\mathbf{x}}_{i} ) \) is, in fact, the patient-level survival function as it is conditionally on the patient characteristics \( {\mathbf{x}}_{i} \). The survival function at \( {\mathbf{x}}_{i} = {\mathbf{0}} \) is called the baseline survival function and denoted as \( S_{0} (t) = S(t|{\mathbf{x}}_{i} = {\mathbf{0}}) \).

A parametric model is given by a survival function that is specified by a finite number of parameters. For instance, we consider an exponential survival function \( S(t|x_{i} ) = \exp (\; - \lambda te^{{\beta x_{i} }} \;) \), \( t \ge 0 \), where \( \lambda > 0 \) and \( - \infty < \beta < \infty \) are parameters. Let \( x_{i} \) denote the gender with \( x_{i} = 1 \) for male and \( x_{i} = 0 \) for female. One can show that \( S(t|x_{i} ) = S_{0} (t)^{{\exp (\beta x_{i} )}} \) for \( t \ge 0 \), where \( S_{0} (t) = S(t|x_{i} = 0) = \exp ( - \lambda t) \) is the baseline survival function. With this model, survival difference between male and female is captured by \( \beta \). The case \( \beta > 0 \) corresponds to poor survival prognosis for male relative to female; the case \( \beta < 0 \) corresponds to good survival prognosis for male relative to female. The case \( \beta = 0 \) corresponds to equal survival prognosis between male and female.

A semi-parametric model is given by a survival function that is partially specified by a finite number of parameters. For instance, we consider a survival function \( S(t|x_{i} ) = S_{0} (t)^{{\exp (\beta x_{i} )}} \), where the form of the baseline survival function \( S_{0} (t) \) is unspecified. In terms of \( \beta \), one can compare survival between males and females without assuming a specific model on the baseline survival function.

2.3 Hazard Function

Hereafter, we suppose that \( S(t|{\mathbf{x}}_{i} ) \) is a continuous survival function. The instantaneous death probability between \( t \) and \( t + {\text{d}}t \) is \( \Pr (\;t \le T < t + {\text{d}}t\;|\;{\mathbf{x}}_{i} \;) = S(\;t\;|\;{\mathbf{x}}_{i} \;) - S(\;t + {\text{d}}t\;|\;{\mathbf{x}}_{i} \;) \), where \( {\text{d}}t \) is an infinitely small number. Since this probability is equal to zero, one can consider the rate by dividing by \( {\text{d}}t \) such that

$$ f(\;t|\;{\mathbf{x}}_{i} \;) = \frac{{S(t\;|\;{\mathbf{x}}_{i} \;) - S(t + {\text{d}}t\;|\;{\mathbf{x}}_{i} \;)}}{{{\text{d}}t}} = \mathop {\lim }\limits_{\Delta t \to 0} \frac{{S(t\;|\;{\mathbf{x}}_{i} \;) - S(t + \Delta t\;|\;{\mathbf{x}}_{i} \;)}}{\Delta t} = - \frac{{{\text{d}}S(t\;|\;{\mathbf{x}}_{i} \;)}}{{{\text{d}}t}}. $$

This is the density function.

The hazard rate describes the instantaneous death rate between \( t \) and \( t + {\text{d}}t \) given that the patient is at-risk at \( t \):

Definition 2

The hazard function (or hazard rate function) is defined as

$$ h(t|{\mathbf{x}}_{i} ) \equiv \frac{{\Pr (\;t \le T < t + {\text{d}}t\;|T \ge t,\;{\mathbf{x}}_{i} \;)}}{{{\text{d}}\textit{t}}} = \frac{{ - \frac{\text{d}}{{{\text{d}t}}}\;S(\;t|\;{\mathbf{x}}_{i} \;)}}{{S(\;t|\;{\mathbf{x}}_{i} \;)}}. $$

The hazard function at \( {\mathbf{x}}_{i} = {\mathbf{0}} \) is called the baseline hazard function and denoted as \( h_{0} (t) = h(t|{\mathbf{x}}_{i} = {\mathbf{0}}) \). The cumulative hazard function is defined as \( H(t|{\mathbf{x}}_{i} ) = \int_{0}^{t} {h(u|{\mathbf{x}}_{i} ){\text{d}}u} \). The survival function is derived from the hazard function through \( S(t|{\mathbf{x}}_{i} ) = \exp \{ \; - H(t|{\mathbf{x}}_{i} )\;\} \).

The hazard rate is also known as the force of mortality in actuarial science and demography. For example, let \( t = \) “60 years old”, \( {\text{d}}t = \) “1 year”, and \( x_{i} = 1 \) for male or \( x_{i} = 0 \) for female. Then, the force of mortality \( h(60|x_{i} = 1) \) is equal to the probability of death within the next one year for a 60-year-old man. The Japanese life tables show \( h(60|x_{i} = 1) \) = 0.0064 (0.64%). The value of \( h(t|x_{i} = 1) \) monotonically increases as \( t \) grows, which represents the effect of natural aging. Eventually, it reaches \( h(100|x_{i} = 1) \) = 0.3995 (39.95%). This implies that 40% of Japanese males who have just celebrated their 100th birthday will die before their next birthday. Life tables for almost any country are available in the internet (e.g., Google “Taiwan life table”).

Unfortunately, the hazard function for cancer patients in medical studies rarely shows any simple pattern (e.g., monotonically increasing or decreasing). In many clinical trials, the time \( t \) is measured from the start of treatment, and hence, the ages are regarded as covariates. In this case, the hazard of patients may be influenced by the treatment effect, the follow-up processes, and cancer progression, so the effect of natural aging may diminish. In epidemiological studies, focusing on age-specific incidence of a particular disease, the time \( t \) is measured from birth as in the example from Japanese life tables. However, the shape of the hazard function of disease incidence may be difficult to specify.

This implies that many simple models, such as the exponential, Weibull, and lognormal models, may not fit survival data from cancer patients. This is why semi-parametric models are more useful and widely applied in medical research. One may still accept the assumptions that the hazard function is continuous, does not abruptly change over time, and smooth (continuously differentiable). Hazard models with cubic splines (Chap. 4) meet these assumptions without restricting too much the shape of the hazard function.

The semi-parametric model \( S(t|x_{i} ) = S_{0} (t)^{{\exp (\beta x_{i} )}} \) can alternatively be specified in terms of the hazard function

$$ h(t|x_{i} ) = h_{0} (t)\exp (\beta x_{i} ) $$
(2.1)

where the form of \( h_{0} (t) \) is unspecified. One can show \( h_{0} (t) = - d\{ \;\log S_{0} (t)\;\} /{\text{d}}t \) and \( S_{0} (t) = \exp \{ \; - H_{0} (t)\;\} \), where \( H_{0} (t) = \int_{0}^{t} {h_{0} (u){\text{d}}u} \).

Let \( x_{i} \) be a dichotomous covariate, such as gender with \( x_{i} = 1 \) for male and \( x_{i} = 0 \) for female. Under the model (2.1), the relative risk (RR) is defined as

$$ RR = \exp (\beta ) = \frac{{h(t|x_{i} = 1)}}{{h(t|x_{i} = 0)}}. $$

For instance, the value \( RR = 2 \) implies that death rate for \( x_{i} = 1 \) is twice the death rate for \( x_{i} = 0 \).

Let \( x_{i} \) be a continuous covariate, such as a gene expression. If the scale of \( x_{i} \) is standardized (to be mean = 0 and SD = 1), then \( RR = \exp (\beta ) \) is interpreted with respect to one SD increase. If one is interested in the effect of \( x_{i} = 2 \) relative to \( x_{i} = - 2 \), then \( RR = \exp (4\beta ) \).

2.4 Log-Rank Test for Two-Sample Comparison

The log-rank test is a method to test the quality of the hazard rates between two groups. Specifically, we consider the null hypothesis

$$ H_{0} :\;h(t|x_{i} = 0) = h(t|x_{i} = 1),\quad \quad t \ge 0, $$

where \( x_{i} = 1 \) for male and \( x_{i} = 0 \) for female, for instance. This null hypothesis is the same as the equality \( S(t|x_{i} = 0) = S(t|x_{i} = 1) \) due to the relationship between the hazard function and survival function. We wish to test \( H_{0} \) without making any model assumption, but with the assumption that there are no ties in death times. The treatment of ties shall be briefly discussed in Sect. 2.8.

Let \( n_{i1} = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } \ge t_{i} ,\;x_{\ell } = 1\;\} } \) be the number of males and \( n_{i0} = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } \ge t_{i} ,\;x_{\ell } = 0\;\} } \) be the number of females at-risk at time \( t_{i} \). Hence, \( n_{i0} + n_{i1} \) is the total number at-risk at time \( t_{i} \). Each death at time \( t_{i} \) corresponds to either the death of male (\( x_{i} = 1 \)) or the death of female (\( x_{i} = 0 \)). If there is no effect of gender on survival, male and female have the same death rate. Hence, the conditional expectation of \( x_{i} \) given \( (\delta_{i} = 1,\;n_{i0} ,\;n_{i1} ) \) is

$$ \begin{aligned} E[x_{i} |\delta_{i} = 1,\;n_{i0} ,\;n_{i1} ] & = \Pr (x_{i} = 1|\delta_{i} = 1,\;n_{i0} ,\;n_{i1} ) \\ & = \frac{{\Pr (x_{i} = 1,\;\delta_{i} = 1|\;n_{i0} ,\;n_{i1} )}}{{\Pr (x_{i} = 1,\;\delta_{i} = 1|\;n_{i0} ,\;n_{i1} ) + \Pr (x_{i} = 0,\;\delta_{i} = 1|\;n_{i0} ,\;n_{i1} )}} \\ & = \frac{{n_{i1} h(t_{i} |x_{i} = 1)}}{{n_{i1} h(t_{i} |x_{i} = 1) + n_{i0} h(t_{i} |x_{i} = 0)}} \\ & = \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}. \\ \end{aligned} $$

The last equation holds under the null hypothesis \( H_{0} \). The difference between \( x_{i} \) and its expectation leads to the log-rank statistic

$$ S = \sum\limits_{i = 1}^{n} {\delta_{i} \left( {x_{i} - \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}} \right)} . $$

Hence, \( S > 0 \) is associated with higher death rate in male than that in female. Under \( H_{0} \), the mean of \( S \) is zero. If we assume that \( x_{i} \)’s are independent,

$$ Var(S) = \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{n_{i1} n_{i0} }}{{(n_{i0} + n_{i1} )^{2} }}} . $$

The log-rank test for no gender effect is based on the Z-statistic \( z = S/\sqrt {Var(S)} \) or the chi-square statistic \( z^{2} \). The P-value is computed as \( \Pr (\;|Z| > |z|\;) \), where \( Z \sim N(0,\;1) \).

Example 1

Consider a sample of five females and five males (\( n = 10 \)) with \( t_{i} = \)(1650, 30, 720, 450, 510, 1110, 210, 1380, 1800, 540), \( \delta_{i} = \)(0, 1, 0, 1, 1, 0, 1, 1, 0, 1), and \( x_{i} = \)(0, 0, 0, 0, 0, 1, 1, 1, 1, 1). To calculate the log-rank statistic, it is convenient to summarize the data into Table 2.1.

Table 2.1 Tabulation of the \( n = 10 \) samples

The log-rank statistic has the “(observed)-(expected)” form, namely

$$ S = \sum\limits_{i = 1}^{n} {\delta_{i} x_{i} } - \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}} = 3 - \left( {\frac{5}{10} + \frac{5}{9} + \frac{4}{8} + \frac{4}{7} + \frac{4}{6} + \frac{2}{3}} \right) = 3 - 3.46 = - 0.46. $$

The negative value of S implies that the observed mortality of male is lower than its expected value under \( H_{0} \). The variance is computed from Table 2.1 as

$$ Var(S) = \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{n_{i1} n_{i0} }}{{(n_{i0} + n_{i1} )^{2} }}} = \frac{5 \times 5}{{10^{2} }} + \frac{5 \times 4}{{9^{2} }} + \frac{4 \times 4}{{8^{2} }} + \frac{4 \times 3}{{7^{2} }} + \frac{4 \times 2}{{6^{2} }} + \frac{2 \times 1}{{3^{2} }} = 1. 4 3 6. $$

Hence, the test statistic is \( z = S/\sqrt {Var(S)} = - 0.46/\sqrt {1.436} = - 0.384 \), and the P-value is \( \Pr (\;|Z| > 0.384\;) = 0.70 \). We see no significant evidence for gender effect on survival. ■

The log-rank test is a non-parametric test that does not employ any distributional assumption. The log-rank test simply examines the excess mortality. Software packages for survival analysis display both “observed” and “expected” numbers of deaths in their outputs, in addition to the Z-value and P-value. The log-rank test can also handle left-truncation (Klein and Moeschberger 2003). The log-rank test has variants, such as multi-group tests, log-rank trend tests, and stratified log-rank tests (Collett 2003; Klein and Moeschberger 2003).

2.5 Cox Regression

Since the hazard function is the basis of the risk comparison between two groups, it is then natural to incorporate the effect of covariates into the hazard function.

Definition 3

The Cox proportional hazards model (Cox 1972 ) is defined as

$$ h(t|{\mathbf{x}}_{i} ) = h_{0} (t)\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} ), $$

where \( {\varvec{\upbeta}} = (\beta_{1} , \ldots ,\beta_{p} )^{\prime } \) are unknown coefficients and \( h_{0} ( \cdot ) \) is an unknown baseline hazard function.

The Cox model states that the hazard function \( h(t|{\mathbf{x}}_{i} ) \) is proportional to \( h_{0} (t) \) with the relative risk \( \exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} ) \). This implies that all patients share the same time-trend function \( h_{0} (t) \). The most striking feature of the Cox model is that the form of \( h_{0} ( \cdot ) \) is unspecified. Hence, the Cox model is a semi-parametric model, offering greater flexibility over parametric models that specify the form of \( h(t|{\mathbf{x}}_{i} ) \).

One can estimate \( {\varvec{\upbeta}} \) without estimating \( h_{0} ( \cdot ) \). Based on data \( (t_{i} ,\;\delta_{i} ,\;{\mathbf{x}}_{i} ) \), \( i = 1, \ldots ,n \), let \( R_{i} = \{ \;\ell :\;\,t_{\ell } \ge t_{i} \;\} \) be the risk set that contains patients at-risk at time \( t_{i} \). The partial likelihood estimator \( {\hat{\varvec{\upbeta}}} = (\hat{\beta }_{1} , \ldots ,\hat{\beta }_{p} )^{\prime} \) is defined by maximizing the partial likelihood function (Cox 1972)

$$ L({\varvec{\upbeta}}) = \prod\limits_{i = 1}^{n} {\left( {\frac{{\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} )}}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}} \right)^{{\delta_{i} }} } . $$

The log-partial likelihood is

$$ \ell ({\varvec{\upbeta}}) = \log L({\varvec{\upbeta}}) = \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {{\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} - \log \left\{ {\sum\limits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} } \right\}} \right]} . $$
(2.2)

The derivatives of \( \ell ({\varvec{\upbeta}}) \) give the score function ,

$$ {\mathbf{S}}({\varvec{\upbeta}}) = \frac{{\partial \ell ({\varvec{\upbeta}})}}{{\partial {\varvec{\upbeta}}}} = \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {{\mathbf{x}}_{i} - \frac{{\sum\nolimits_{{\ell \in R_{i} }} {{\mathbf{x}}_{\ell } \exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}} \right]} . $$

The second-order derivatives of \( \ell ({\varvec{\upbeta}}) \) constitute the Hessian matrix ,

$$ H({\varvec{\upbeta}}) = \frac{{\partial^{2} \ell ({\varvec{\upbeta}})}}{{\partial {\varvec{\upbeta}}\partial {\varvec{\upbeta}}^{\prime } }} = - \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {\frac{{\sum\nolimits_{{\ell \in R_{i} }} {{\mathbf{x}}_{\ell } {\mathbf{x}}_{\ell }^{\prime } \exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }} - \frac{{\sum\nolimits_{{\ell \in R_{i} }} {{\mathbf{x}}_{\ell } \exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}\left\{ {\frac{{\sum\nolimits_{{\ell \in R_{i} }} {{\mathbf{x}}_{\ell } \exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }}} \right\}^{\prime } } \right]} . $$

Since \( H({\varvec{\upbeta}}) \) is a negative definite matrix (see Exercise 3 in Sect. 2.9), the log-likelihood \( \ell ({\varvec{\upbeta}}) \) is concave. This implies that \( \ell ({\varvec{\upbeta}}) \) has a unique maxima \( {\hat{\varvec{\upbeta}}} \) that solves \( {\mathbf{S}}({\varvec{\upbeta}}) = {\mathbf{0}} \).

Interval estimation for \( {\varvec{\upbeta}} \) is implemented by applying the asymptotic theory (Fleming and Harrington 1991). The information matrix is defined as \( i({\hat{\varvec{\upbeta}}}) = - H({\hat{\varvec{\upbeta}}}). \) The standard error (SE) of \( \hat{\beta }_{j} \) is \( SE(\hat{\beta }_{j} ) = \sqrt {\{ \;i^{ - 1} ({\hat{\varvec{\upbeta}}})\;\}_{jj} } \) that uses the j-th diagonal element of the inverse information matrix. The 95% confidence interval (CI) is \( \hat{\beta }_{j} \pm 1.96 \times SE(\hat{\beta }_{j} ) \).

To gain more insight into Cox regression, we consider a simple case where \( x_{i} \) denote the gender defined as \( x_{i} = 1 \) for male and \( x_{i} = 0 \) for female. In this setting, the Cox model is written as \( h(t|x_{i} ) = h_{0} (t)\exp (\beta x_{i} ) \), where the factor \( \exp (\beta ) \) represents the RR of male relative to female.

We shall demonstrate how the factor \( \exp (\beta ) \) is estimated by maximizing the log-partial likelihood in Eq. (2.2). We solve the score equation \( S(\beta ) = 0 \) where

$$ S(\beta ) = \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {x_{i} - \frac{{\sum\nolimits_{{\ell \in R_{i} }} {x_{\ell } \exp (\beta x_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp (\beta x_{\ell } )} }}} \right]} = \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{x_{i} n_{i0} - (1 - x_{i} )n_{i1} \exp (\beta )}}{{n_{i0} + n_{i1} \exp (\beta )}}} . $$

Hence, the estimate of \( \exp (\beta ) \) needs to satisfy the equation

$$ \exp (\beta ) = \frac{{\sum\limits_{{i:x_{i} = 1}} {\delta_{i} \frac{{n_{i0} }}{{n_{i0} + n_{i1} \exp (\beta )}}} }}{{\sum\limits_{{i:x_{i} = 0}} {\delta_{i} \frac{{n_{i1} }}{{n_{i0} + n_{i1} \exp (\beta )}}} }}. $$
(2.3)

This is the ratio of the expected number of deaths in male divided by the expected number of deaths in female, which agrees with the interpretation of \( \exp (\beta ) \).

Equation (2.3) can be solved by the fixed-point iteration algorithm. First, applying the initial value \( \exp (\beta ) = 1 \) to the right-hand side of Eq. (2.3), we have

$$ \exp (\beta ) = \frac{{\sum\limits_{{i:x_{i} = 1}} {\delta_{i} \frac{{n_{i0} }}{{n_{i0} + n_{i1} }}} }}{{\sum\limits_{{i:x_{i} = 0}} {\delta_{i} \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}} }}. $$

We apply this value of \( \exp (\beta ) \) to the right-hand side of Eq. (2.3) to give an updated value of \( \exp (\beta ) \). This process is repeated until the updated value does not change from the previous step. While the fixed-point iteration gives us an insight about how \( \exp (\beta ) \) is estimated from data, it requires a large number of iterations until convergence.

A computationally faster algorithm is the Newton–Raphson algorithm , which utilizes the score function \( S(\beta ) = d\ell (\beta )/d\beta \) and the Hessian \( H(\beta ) = d^{2} \ell (\beta )/d\beta^{2} \). The algorithm starts with the initial value \( \beta^{(0)} = 0 \), and then follows the sequence

$$ \beta^{(k + 1)} = \beta^{(k)} - H^{ - 1} (\beta^{(k)} )S(\beta^{(k)} ),\quad k = 0,1, \ldots $$

The algorithm converges if \( |\beta^{(k + 1)} - \beta^{(k)} | \approx 0 \). Then, the estimate is \( \hat{\beta } = \beta^{(k)} \) and its standard error is \( SE(\hat{\beta }) = \sqrt { - H^{ - 1} (\hat{\beta })} \). The score function is

$$ S(\beta ) = \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {x_{i} - \frac{{\sum\nolimits_{{\ell \in R_{i} }} {x_{\ell } \exp (\beta x_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp (\beta x_{\ell } )} }}} \right]} = \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {x_{i} - \frac{{n_{i1} \exp (\beta )}}{{n_{i0} + n_{i1} \exp (\beta )}}} \right]} , $$

and the Hessian is

$$ H(\beta ) = - \sum\limits_{i = 1}^{n} {\delta_{i} \left[ {\frac{{\sum\nolimits_{{\ell \in R_{i} }} {x_{\ell }^{2} \exp (\beta x_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp (\beta x_{\ell } )} }} - \left\{ {\frac{{\sum\nolimits_{{\ell \in R_{i} }} {x_{\ell }^{{}} \exp (\beta x_{\ell } )} }}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp (\beta x_{\ell } )} }}} \right\}^{2} } \right]} = - \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{n_{i0} n_{i1} \exp (\beta )}}{{\{ \;n_{i0} + n_{i1} \exp (\beta )\;\}^{2} }}} . $$

We use Example 1 to compare the convergence between the fixed-point iteration and Newton–Raphson algorithms. Table 2.2 shows that the Newton–Raphson converges faster than the fixed-point iteration. The two algorithms reach the same value \( \hat{\beta } = - 0.3156 \).

Table 2.2 Iteration algorithms to compute \( \hat{\beta } \) using the data of Example 1

The Wald test for the null hypothesis \( H_{0} :\beta = 0 \) is based on the Z-value \( z = \hat{\beta }/SE(\hat{\beta }) \). The P-value is computed as \( \Pr (\;|Z| > |z|\;) \), where \( Z \sim N(0,1) \).

The score test for the null hypothesis \( H_{0} :\beta = 0 \) uses the score statistic, and its variance,

$$ S(0) = \sum\limits_{i = 1}^{n} {\delta_{i} \left( {x_{i} - \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}} \right)} ,\quad Var\{ \;S(0)\;\} = - H(0) = \sum\limits_{i = 1}^{n} {\delta_{i} \frac{{n_{i1} n_{i0} }}{{(n_{i0} + n_{i1} )^{2} }}} . $$

The score test based on \( z = S(0)/\sqrt {Var\{ \;S(0)\;\} } \) is exactly the same as the log-rank test . This coincidence does not imply that the log-rank test relies on the Cox model assumption (Sect. 2.8).

The Newton–Raphson algorithm can also be applied to the multi-dimensional case (\( p \ge 2 \)) (see Sect. 2.7). The fixed-point iteration algorithm, however, may not be easily applied to the multi-dimensional case (see Exercise 4 in Sect. 2.9).

2.6 R Survival Package

We shall briefly introduce the R package survival to analyze real data. After installing the package, we enter survival time \( t_{i} \), censoring indicator \( \delta_{i} \), and covariate \( x_{i} \) for \( n = 10 \) patients. Then, we run the codes:

The outputs are shown below and Fig. 2.1.

Fig. 2.1
figure 1

Kaplan–Meier survival curve and the 95% CI calculated from the data of Example 1 (\( n = 10 \)). Censoring times are indicated as the mark“+”

The results on the log-rank test show \( S = 3 - 3.46 = - 0.46 \) with the chi-square statistics \( z^{2} = 0.148 \) and the P-value = 0.701 (see the row of “x = 1”). The results on Cox regression show \( \hat{\beta } = - 0.316 \), \( RR = \exp (\hat{\beta }) = 0.729 \), \( SE(\hat{\beta }) = 0.825 \), and \( z = \hat{\beta }/SE(\hat{\beta }) = - 0.38 \). The P-value of the Wald test is 0.702. Hence, the log-rank test and the Wald test show similar results. In addition, the log-rank test and the score test yield the identical result.

Since the difference between the two groups is not significant, we combine the two groups and then draw the Kaplan–Meier survival curve. Figure 2.1 display the Kaplan–Meier survival curve and the 95% CI.

2.7 Likelihood-Based Inference

This section considers likelihood-based methods for analyzing the data \( (\;t_{i} ,\;\delta_{i} ,\;{\mathbf{x}}_{i} \;) \), \( i = 1, \ldots ,n \). Recall that we defined survival time \( T \) and censoring time \( U \) such that:

  • \( T = t_{i} \) and \( U > t_{i} \) if \( \delta_{i} = 1 \),

  • \( T > t_{i} \) and \( U = t_{i} \) if \( \delta_{i} = 0 \).

Combining these two events, the likelihood for the \( i \)-th patient is expressed as

$$ L_{i} = \Pr (\;T = t_{i} ,\;U > t_{i} |{\mathbf{x}}_{i} \;)^{{\delta_{i} }} \Pr (\;T > t_{i} ,\;U = t_{i} |{\mathbf{x}}_{i} \;)^{{1 - \delta_{i} }} . $$

Under the independent censoring assumption,

$$ \begin{aligned} L_{i} & = [\;\Pr (\;T = t_{i} |{\mathbf{x}}_{i} \;)\Pr (\;U > t_{i} |{\mathbf{x}}_{i} \;)\;]^{{\delta_{i} }} [\;\Pr (\;T > t_{i} |{\mathbf{x}}_{i} \;)\Pr (\;U = t_{i} |{\mathbf{x}}_{i} \;)\;]^{{1 - \delta_{i} }} \\ & = [\;f_{T} (t_{i} |{\mathbf{x}}_{i} )S_{U} (t_{i} |{\mathbf{x}}_{i} )\;]^{{\delta_{i} }} [\;S_{T} (t_{i} |{\mathbf{x}}_{i} )f_{U} (t_{i} |{\mathbf{x}}_{i} )\;]^{{1 - \delta_{i} }} \\ & = [\;f_{T} (t_{i} |{\mathbf{x}}_{i} )^{{\delta_{i} }} S_{T} (t_{i} |{\mathbf{x}}_{i} )^{{1 - \delta_{i} }} \;][\;f_{U} (t_{i} |{\mathbf{x}}_{i} )^{{1 - \delta_{i} }} S_{U} (t_{i} |{\mathbf{x}}_{i} )^{{\delta_{i} }} \;] \\ \end{aligned} $$

where \( S_{T} (t|{\mathbf{x}}_{i} ) = \Pr (\;T > t\;|\;{\mathbf{x}}_{i} \;) \), \( f_{T} (t|{\mathbf{x}}_{i} ) = - {\text{d}}S_{T} (t|{\mathbf{x}}_{i} )/{\text{d}}t \), \( S_{U} (t|{\mathbf{x}}_{i} ) = \Pr (\;U > t\;|\;{\mathbf{x}}_{i} \;) \), and \( f_{U} (t|{\mathbf{x}}_{i} ) = - {\text{d}}S_{U} (t|{\mathbf{x}}_{i} )/{\text{d}}t \). In addition to the independent censoring assumption, we further impose the following assumption:

Non-informative censoring assumption: The censoring distribution does not involve any parameters related to the distribution of the survival times. That is, \( S_{U} (t|{\mathbf{x}}_{i} ) \) does not contain parameters related to \( S_{T} (t|{\mathbf{x}}_{i} ) \).

Under the non-informative censoring assumption, the term \( f_{U} (t_{i} |{\mathbf{x}}_{i} )^{{1 - \delta_{i} }} S_{U} (t_{i} |{\mathbf{x}}_{i} )^{{\delta_{i} }} \) is unrelated to the likelihood for the survival times and can simply be ignored. Therefore, the likelihood function is re-defined as

$$ L = \prod\limits_{i = 1}^{n} {f_{T} (t_{i} |{\mathbf{x}}_{i} )^{{\delta_{i} }} S_{T} (t_{i} |{\mathbf{x}}_{i} )^{{1 - \delta_{i} }} } = \prod\limits_{i = 1}^{n} {h_{T} (t_{i} |{\mathbf{x}}_{i} )^{{\delta_{i} }} \exp [\; - H_{T} (t_{i} |{\mathbf{x}}_{i} )\;]} , $$

where \( h_{T} (t|{\mathbf{x}}_{i} ) = f_{T} (t|{\mathbf{x}}_{i} )/S_{T} (t|{\mathbf{x}}_{i} ) \) and \( H_{T} (t|{\mathbf{x}}_{i} ) = \int_{0}^{t} {h_{T} (u|{\mathbf{x}}_{i} ){\text{d}}u} \). The log-likelihood is

$$ \ell = \log L = \sum\limits_{i = 1}^{n} {[\;\delta_{i} \log h_{T} (t_{i} |{\mathbf{x}}_{i} ) - H_{T} (t_{i} |{\mathbf{x}}_{i} )\;]} . $$
(2.4)

Usually, censoring is non-informative if it is independent. Only an artificial or unusual example yields informative but independent censoring (p. 150 of Andersen et al. 1993; p. 196 of Kalbfleisch and Prentice 2002). It is well-known that independent censoring is more crucial assumption than non-informative censoring that does not lead to bias in estimation. Throughout the book, we focus on dependent censoring rather than informative censoring.

If censoring is dependent, the likelihood for the \( i \)-th patient is

$$ \begin{aligned} L_{i} & = \Pr (\;T = t_{i} ,\;U > t_{i} |{\mathbf{x}}_{i} \;)^{{\delta_{i} }} \Pr (\;T > t_{i} ,\;U = t_{i} |{\mathbf{x}}_{i} \;)^{{1 - \delta_{i} }} \\ & = \left\{ {\left. { - \frac{\partial }{\partial x}\Pr (\;T > x,\;U > t_{i} |{\mathbf{x}}_{i} \;)} \right|_{{x = t_{i} }} } \right\}^{{\delta_{i} }} \left\{ {\left. { - \frac{\partial }{\partial y}\Pr (\;T > t_{i} ,\;U > y|{\mathbf{x}}_{i} \;)} \right|_{{y = t_{i} }} } \right\}^{{1 - \delta_{i} }} . \\ \end{aligned} $$

Therefore, the log-likelihood is defined as

$$ \ell = \sum\limits_{i = 1}^{n} {[\delta_{i} \log h_{T}^{\# } (t_{i} |{\mathbf{x}}_{i} ) + (1 - \delta_{i} )\log h_{U}^{\# } (t_{i} |{\mathbf{x}}_{i} )- {\Phi }(t_{i} ,\;t_{i} |{\mathbf{x}}_{i} )\;]} , $$

where

$$ \left. {h_{T}^{\# } (t_{i} |{\mathbf{x}}_{i} ) = - \frac{\partial }{\partial x}\log \,\Pr (\;T > x,\;U > t_{i} |{\mathbf{x}}_{i} \;)} \right|_{{x = t_{i} }} ,\quad \left. {h_{U}^{\# } (t_{i} |{\mathbf{x}}_{i} ) = - \frac{\partial }{\partial y}\log \,\Pr (\;T > t_{i} ,\;U > y|{\mathbf{x}}_{i} \;)} \right|_{{y = t_{i} }} , $$

are the cause-specific hazard functions , and

$$ {\Phi} (t_{i} ,\;t_{i} |{\mathbf{x}}_{i} ) = - \log \,\Pr (\;T > t_{i} ,\;U > t_{i} |{\mathbf{x}}_{i} \;) = - \log \,\Pr (\;\hbox{min} \{ \;T,\;U\;\} > t_{i} |{\mathbf{x}}_{i} \;) $$

is the cumulative hazard function for \( \hbox{min} \{ \;T,\;U\;\} \).

Suppose that the log-likelihood is parameterized by \( {\varvec{\upvarphi }}. \) Then, the maximum likelihood estimator (MLE) is defined by maximizing the log-likelihood, \( {\hat{\varvec{\upvarphi }}} = \arg \max_{{\varvec{\upvarphi }}} \ell (\;{\varvec{\upvarphi }}\;). \) To find the MLE numerically, one can use the score function \( {\mathbf{S}}(\;{\varvec{\upvarphi }}\;) = \partial \ell (\;{\varvec{\upvarphi }}\;)/\partial {\varvec{\upvarphi }} \) and the Hessian matrix \( H(\;{\varvec{\upvarphi }}\;) = \partial^{2} \ell (\;{\varvec{\upvarphi }}\;)/\partial {\varvec{\upvarphi }}\partial {\varvec{\upvarphi^{\prime}}}. \) The MLE \( {\hat{\varvec{\upvarphi }}} \) is obtained from the Newton–Raphson algorithm

$$ {\varvec{\upvarphi }}^{(k + 1)} = {\varvec{\upvarphi }}^{(k)} - H^{ - 1} (\;{\varvec{\upvarphi }}^{(k)} \;){\mathbf{S}}(\;{\varvec{\upvarphi }}^{(k)} \;),\qquad k = 0,1, \ldots $$

Interval estimates for \( {\varvec{\upvarphi }} \) follow from the asymptotic theory of MLEs. The information matrix is defined as \( i(\;{\hat{\varvec{\upvarphi }}}\;) = - H(\;{\hat{\varvec{\upvarphi }}}\;). \) The SE for \( \hat{\varphi }_{j} \) (the j-th component of \( {\hat{\varvec{\upvarphi }}}) \) is \( SE({\hat{\phi }}_{j} ) = \sqrt {\{ \;i^{ - 1} ({\hat{\varvec{\upvarphi }}})\;\}_{jj} } \) that uses the j-th diagonal element of the inverse information matrix. The 95% CI is \( \hat{\varphi }_{j} \pm 1.96 \times SE(\hat{\varphi }_{j} ). \)

For instance, the Cox model takes the form \( {\varvec{\upvarphi }} = ({\varvec{\uptheta}},\;{\varvec{\upbeta}}) \) and \( h(t|{\mathbf{x}}_{i} ) = h_{0} (t;{\varvec{\uptheta}})\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} ) \), where \( {\varvec{\uptheta}} = (\theta_{1} ,\; \ldots ,\;\theta_{m} ) \) is a vector of parameters related to the baseline hazard function. We assume that the baseline cumulative hazard function \( H_{0} (t;{\varvec{\uptheta}}) \) is an increasing step function with jumps \( dH_{0} (t;{\varvec{\uptheta}}) = e^{{\theta_{j} }} \) at \( t = t_{i} \) with \( \delta_{i} = 1 \). Hence, the number of parameters in \( {\varvec{\uptheta}} \) is equal to the number of deaths \( m = \sum\nolimits_{i = 1}^{n} {\delta_{i} } \). The MLE \( {\hat{\varvec{\upvarphi }}} = ({\hat{\varvec{\uptheta}}},\;{\hat{\varvec{\upbeta}}}) \) is obtained from the Newton–Raphson algorithm. It has been shown that \( {\hat{\varvec{\upbeta}}} \) is equivalent to the partial likelihood estimator and \( {\hat{\varvec{\uptheta}}} \) is the Breslow estimator \( h_{0} (t_{j} ;{\hat{\varvec{\uptheta}}}) = e^{{\hat{\theta }_{j} }} = \left( {\sum\nolimits_{{t_{\ell } \ge t_{j} }} {e^{{{\hat{\varvec{\upbeta}}}^{\prime } {\mathbf{x}}_{\ell } }} } } \right)^{ - 1} \) (van der Vaart 1998; van Houwelingen and Putter 2011).

2.8 Technical Notes

Readers can skip this section as it does not influence the understanding of the latter chapters of the book.

The log-rank test possesses an easy-to-understand optimality criterion. The log-rank test is asymptotically efficient (most powerful) to detect the constant hazard ratio \( h(t|x_{i} = 1)/h(t|x_{i} = 0) = \psi \) for some \( \psi \ne 1 \). Any reasonable test, such as the t-test, has optimality criteria to detect some specific form. The details on the asymptotic efficiency are referred to Andersen et al. (1993) and Fleming and Harrington (1991).

If the form of \( h(t|x_{i} = 1)/h(t|x_{i} = 0) \) is non-constant, then the log-rank test may be sub-optimal. For example, Gehan’s generalized Wilcoxon test statistic (Gehan 1965) defined as

$$ S = \sum\limits_{i = 1}^{n} {\delta_{i} (n_{i0} + n_{i1} )\left( {x_{i} - \frac{{n_{i1} }}{{n_{i0} + n_{i1} }}} \right)} $$

can be more powerful than the log-rank statistic if the ratio \( h(t|x_{i} = 1)/h(t|x_{i} = 0) \) strongly deviates from 1 in the early stage of follow-up. The generalized Wilcoxon test statistic is a special case of the weighted log-rank statistics (Fleming and Harrington 1991; Klein and Moeschberger 2003). If there is a concern about the non-constant hazard ratio, the weighted log-rank statistics may be employed.

A gross misunderstanding is that the log-rank test is a test tailored to detect the effect in a proportional hazards assumption. As mentioned earlier, the log-rank statistic is a non-parametric test to detect excess mortality without any model assumption.

We have derived the Kaplan–Meier estimator and the log-rank test under the assumption that all times to death are distinct (no ties). To handle ties, it is useful to introduce counting process formulations (Andersen et al. 1993; Fleming and Harrington 1991). For \( k = 0,\;1 \), let \( \bar{Y}_{k} (t) = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } \ge t,\;x_{\ell } = k\;\} } \) be the number at-risk at time \( t \), and let \( \bar{N}_{k} (t) = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } \le t,\;\delta_{\ell } = 1,\;x_{\ell } = k\;\} } \) be the number of deaths up to time \( t \). Then, at time \( t \), the number of deaths in male is \( d\bar{N}_{1} (t) = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } = t,\;\delta_{\ell } = 1,\;x_{\ell } = 1\;\} } \), and the total number of deaths is \( d\bar{N}(t) = \sum\nolimits_{\ell = 1}^{n} {{\mathbf{I}}\{ \;t_{\ell } = t,\;\delta_{\ell } = 1\;\} } \).

The Kaplan–Meier estimator for the group \( k \) is defined as

$$ \hat{S}_{k} (t) = \prod\limits_{u \le t} {\{ \;1 - {\text{d}}\hat{H}_{k} (u)\;\} } ,\quad k = 0,\;1, $$

where \( {\text{d}}\hat{H}_{k} (t) = {\text{d}}\bar{N}_{k} (t)/\bar{Y}_{k} (t) \) is called the Nelson–Aalen estimator.

The conditional distribution of \( {\text{d}}\bar{N}_{1} (t) \) given \( (\;{\text{d}}\bar{N}(t),\;\bar{Y}_{0} (t),\;\bar{Y}_{1} (t)\;) \) is a hypergeometric distribution with mean

$$ E\{ \;{\text{d}}\bar{N}_{1} (t)|{\text{d}}\bar{N}(t),\;\bar{Y}_{0} (t),\;\bar{Y}_{1} (t)\;\} = \frac{{{\text{d}}\bar{N}(t)\bar{Y}_{1} (t)}}{{\bar{Y}_{0} (t) + \;\bar{Y}_{1} (t)}}. $$

Consequently, the aggregated differences between the observed and expected deaths is

$$ S = \int\limits_{0}^{\infty } {\;\left[ {{\text{d}}\bar{N}_{1} (t) - \frac{{{\text{d}}\bar{N}(t)\bar{Y}_{1} (t)}}{{\bar{Y}_{0} (t) + \;\bar{Y}_{1} (t)}}} \right] = } \int\limits_{0}^{\infty } {\;{\text{d}}\bar{N}_{1} (t)} - \int\limits_{0}^{\infty } {\;\frac{{{\text{d}}\bar{N}(t)\bar{Y}_{1} (t)}}{{\bar{Y}_{0} (t) + \;\bar{Y}_{1} (t)}}} . $$

The univariate partial likelihood estimator as derived in Eq. (2.3) has a counting process form

$$ \exp (\hat{\beta }) = \frac{{\int_{0}^{\infty } {W(\hat{\beta };t){\text{d}}\hat{H}_{1} (t)} }}{{\int_{0}^{\infty } {W(\hat{\beta };t){\text{d}}\hat{H}_{0} (t)} }},\quad W(t;\beta ) = \frac{{\bar{Y}_{0} (t)\bar{Y}_{1} (t)}}{{\bar{Y}_{0} (t) + \;\bar{Y}_{1} (t)\exp (\beta )}}. $$

This means that the estimator is the ratio of the expected number of deaths in male divided by the expected number of deaths in female. This way of interpreting the univariate estimator is suggested in Emura and Chen (2016) to argue the robustness of the estimator against the model misspecification. Under the independent censoring assumption, \( \hat{\beta } \) is a consistent estimator for \( \beta^{*} \) that is the solution to

$$ \exp (\beta ) = \frac{{\int_{0}^{\infty } {w(\beta ;t)h(t|x = 1)dt} }}{{\int_{0}^{\infty } {w(\beta ;t)h(t|x = 0)dt} }},\quad W(t;\beta ) = \frac{{\pi_{0} (t)\pi_{1} (t)}}{{\pi_{0} (t) + \;\pi_{1} (t)\exp (\beta )}}, $$

where \( \pi_{k} (t) = \lim_{n \to \infty } \bar{Y}_{k} (t)/n \) and the integral is on the range of \( t \) with \( \pi_{0} (t)\pi_{1} (t) > 0 \). If the proportional hazards model \( h(t|x_{i} = 1) = \exp (\beta_{0} )h(t|x_{i} = 0) \) holds for some \( \beta_{0} \), then \( \beta^{*} = \beta_{0} \). Even if the proportional hazards model does not hold, \( \beta^{*} \) is still meaningful since \( \exp (\beta^{*} ) \) is interpreted as the RR. However, the interpretation of the partial likelihood estimator may not be robust against the violation of the independent censoring assumption (Chap. 3).

2.9 Exercises

  1. 1.

    Deriving the Kaplan–Meier estimator: Consider a survival function \( S(\;t\;) = \Pr (\;T > t\;) \) that is a decreasing step function with steps at observed times of death. Assume that all the observed times to death are distinct (\( t_{i} \ne t_{j} \) whenever \( i \ne j \) and \( \delta_{i} = \delta_{j} = 1 \)).

  1. (1)

    Show \( \Pr (\;T > t_{i} \;) = \Pr (\;T > t_{i} |T > t_{i - 1} \;)\Pr (T > t_{i - 1} \;) \) if \( t_{i} \; > t_{i - 1} \).

  2. (2)

    Show \( \Pr (\;T > t_{j} \;) = \prod\limits_{i = 1}^{j} {\Pr (\;T > t_{i} |T > t_{i - 1} \;)} \) if \( t_{j} \; > t_{j - 1} > \cdots > t_{1} > t_{0} \equiv 0 \) and \( S(\;0\;) = 1 \).

  3. (3)

    Show \( \Pr (\;T > t_{i} |T > t_{i - 1} \;) = 1 - \Pr (\;T = t_{i} |T \ge t_{i} \;) \) if there is no death in the interval \( (t_{i - 1} ,\;t_{i} ) \).

  4. (4)

    Show \( S(\;t_{j} \;) = \prod\limits_{i = 1}^{j} {\left( {1 - \frac{{\Pr (T = t_{i} )}}{{\Pr (T \ge t_{i} )}}} \right)} \).

  1. 2.

    Weibull regression: Let \( \log (T_{i} ) = \alpha_{0} + {\mathbf{\alpha^{\prime}x}}_{i} + \sigma \varepsilon_{i} \), where \( \Pr (\varepsilon_{i} > x) = \exp ( - e^{x} ) \) for \( - \infty < x < \infty \).

  1. (1)

    Derive the survival function \( S(t|{\mathbf{x}}_{i} ) \) and the hazard function \( h(t|{\mathbf{x}}_{i} ) \).

  2. (2)

    Show that the model can be expressed as \( h(t|{\mathbf{x}}_{i} ) = h_{0} (t)\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{i} ) \).

  3. (3)

    Show \( \Pr (T > t + w|T > t,\;{\mathbf{x}}_{i} ) < \Pr (T > w|{\mathbf{x}}_{i} ) \) for \( 0 < \sigma < 1 \) and \( w > 0 \). What does this inequality imply?

  1. 3.

    Consider a discrete random vector \( {\mathbf{X}}_{i} = (X_{i1} , \ldots ,X_{ip} ) \) whose distribution is given by

$$ \Pr ({\mathbf{X}}_{i} = {\mathbf{x}}_{k} ) = \frac{{\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{k} )}}{{\sum\nolimits_{{\ell \in R_{i} }} {\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}_{\ell } )} }},\quad k \in R_{i} = \{ \ell :\;t_{\ell } \ge t_{i} \;\} ,\quad i = 1, \ldots ,n. $$

This represents the risk of the k-th patient relative to the total risk for those who are at-risk of death at time \( t_{i} \). By assuming the independence of the sequence \( {\mathbf{X}}_{i} \), \( i = 1, \ldots ,n \), one can obtain the partial likelihood function \( L({\varvec{\upbeta}}) = \prod\nolimits_{i = 1}^{n} {\Pr ({\mathbf{X}}_{i} = {\mathbf{x}}_{i} )^{{\delta_{i} }} } \).

  1. (1)

    Express the score function \( {\mathbf{S}}({\varvec{\upbeta}}) \) using \( E({\mathbf{X}}_{i} ) \).

  2. (2)

    Express the Hessian matrix \( H({\varvec{\upbeta}}) \) using \( Var({\mathbf{X}}_{i} ) \).

  3. (3)

    Discuss the conditions to make \( H({\varvec{\upbeta}}) \) negative definite.

  1. 4.

    Suppose that data \( (\;t_{i} ,\;\delta_{i} ,\;x_{i} \;) \), \( i = 1, \ldots ,n \), follow the model \( S(t|x_{i} ) = \exp (\; - \lambda te^{{\beta x_{i} }} \;) \), where \( \lambda > 0 \) and \( - \infty < \beta < \infty \). Let \( m = \sum\nolimits_{i = 1}^{n} {\delta_{i} } \) be the number of deaths.

  1. (1)

    Write down the log-likelihood function \( \ell (\lambda ,\;\beta ) = \log L(\lambda ,\;\beta ) \).

  2. (2)

    Derive the score functions \( \partial \ell (\lambda ,\;\beta )/\partial \lambda \) and \( \partial \ell (\lambda ,\;\beta )/\partial \beta \).

  3. (3)

    Derive the fixed-point iteration algorithm and apply it to the data of Example 1.

  4. (4)

    Derive the Hessian matrix of \( \ell (\lambda ,\;\beta ) \).

  5. (5)

    Derive the Newton–Raphson algorithm and apply it to the data of Example 1.

  6. (6)

    Derive the Newton–Raphson algorithm under the transformed parameter \( \tilde{\lambda } = \log (\lambda ) \) and apply it to the data of Example 1.

  7. (7)

    Compare the numbers of iterations in all the three algorithms.

  1. 5.

    Use the lung cancer data available in the compound. Cox R package (Emura et al. 2018) to:

  1. (1)

    Perform univariate Cox regression treating the ZNF264 gene or the NF1 gene as a covariate. Are these genes univariately associated with survival?

  2. (2)

    Perform multivariate Cox regression treating both the ZNF264 and NF1 genes as covariates. Are these genes associated with survival?

  3. (3)

    Discuss the influence of multicollinearity between ZNF264 and NF1.