Keywords

2.1 Introduction

Cancer remains the main health challenge we are facing nowadays. Seems like every day you find out that another person you know has been diagnosed with cancer. After the shock wave, you feel sad for that person, and guilty because you are somehow happy that this did not happen to you or your loved ones. Even if heart diseases still remain the first cause of death worldwide, the death rate caused by them is rapidly falling. Why is that? Because we found out what causes heart diseases: high blood pressure, high weight, high glucose levels, smoking, drinking, etc. Knowing the cause, we have the means to prevent them. Unfortunately, this is not the case for cancer. We still do not know what causes cancer.

Sadly enough, nowadays most people Google first the symptoms, before speaking to a physician. No matter what we are Googling, cancer will appear on the result page. So, we are alarmed, and while waiting for the doctor’s appointment we start Googling for that type of cancer. The most searched questions are: what is the 5-year survival rate? What is the overall survival? If there is need for surgery, what is the morbidity? What about the mortality rate? So, the number one outcome of interest is survival—overall, disease-free, recurrent, surgery survival.

The Global cancer prevalence rose from 0.54 to 0.64% since 1990. For instance for prostate cancer the rates rose from 67.8% to 98.6%, due to better AI prediction, [1]. Even if the cancer rates are rising, the death rates are falling. This can only mean one thing: early diagnosis and/or better novel treatments, hence people have better and longer survival rates. The 5-year survival rates for all cancers have increased from 50.3% to 67% [2,3,4]. Table 2.1 and Fig. 2.1 present how the 5-year survival rates for different types of cancers have changed from 1970–1977 to 2007–2013 in the USA.

Table 2.1 5-year cancer survival rates in the USA comparison
Fig. 2.1
A double bar graph plots survival rates versus types of cancers. The survival rates of thyroid cancer were high and pancreas cancer was low between the years 2007 and 2013. The rate of breast cancer was high and pancreas cancer was low between the years 1990 and 1997.

5-year cancer survival rates comparison

WHO’s global target (25 × 25) is a 25% reduction in deaths from cancer in people aged 30–69 years by 2025 [5]. Cancer survival research is crucial for developing cancer control strategies [6], control measures [7], so that the effectiveness and costs of them to be assessed [8, 9].

Survival analysis concerns the time until a certain event takes place: i.e. the time that passes from the start of chemotherapy until the tumor stops shrinking (the patient stops responding to treatment), the time elapsed from when a cancer surgery is over until the patients gets out of the ICU (Intensive Care Unit), or the time that passed from the moment the patient started radiotherapy, till she/he passed away.

In this chapter, we are going to present statistical models that are used in survival analysis, along with examples and explanations regarding the obtained results.

2.2 Survival Analysis

Survival analysis deals with survival times. In order for you to start such an analysis you need two variables: a numeric metric of time (i.e. number of days, weeks, or months), and a categorical variable that identifies the event (i.e. irresponsive to treatment, death). The merged variables give us a lot of information about whether a subject has entered or left the study, and if and when the subject has met a certain criterion or not.

Unfortunately, in practice things are never that simple, and we might find ourselves in two situations:

  • The start time cannot be specified. For example, we cannot establish exactly when was the exact onset of the disease. Some cancers progress at a faster pace than others, being more aggressive, but there is no precise method to determine the debut of cancer.

  • The end time is difficult to be determined. If the end time is given by the time of death, then there is no problem in establishing it, but if a person decides to leave the study, or she/he survives more than the time that was set to be recorded (i.e. 5-year survival rate), things change. This is the case of the censored survival time.

In Fig. 2.2 we present an example of how we can record the survival times. The figure is converted into a table, Table 2.2.

Fig. 2.2
A graph plots patient numbers versus the timeline in months. The square indicates the patient has died and the circle indicates the patient is alive. Patient 1: 3. Patient 2: 9. Patient 3: 5. Patient 4: 9. Patient 5: 8. Patient 6: 5. Patient 7: 5. Patient 8: 8. Patient 9: 9.

Survival time recordings

Table 2.2 Tabulated survival time recordings

The dotted line from Fig. 2.2 gives us information about the study: the patients were recruited the first 5 months of the study. The timeline from month 5 till month 9 represents the follow-up part of the study. The black square signifies that the patient has died, whereas the grey circle means that the patient did not die. To be more specific: patient 1 was recruited at the beginning of the study and stayed in the study till month 3, time of which she/he died. The second patient enrolled in the study from month 0, and did not have any event (i.e. died) during the whole observation period, so she/he represents a censored data. Patient 3 also started from the beginning of the study, but died five months later. The fourth patient enrolled on month 1 and stayed until the end of it. The fifth patient enrolled on month 1, and had an event (i.e. death) on month 8. The sixth patient enrolled on month 3, but left the study on month 5. Patient 7 joined the study on month 4 and passed away on month 6. Both patients 8 and 9 joined on month 5, the first leaving the study on month 8 (censored data), while the latter on month 9 (censored).

You can see in Table 2.2 that the censored data are marked with (*).

In general, we use survival analysis in oncology to review the outcomes of clinical trials, cohort studies, etc. For instance, if we have a cohort of 30 patients who have been diagnosed with lung cancer between 2017 and 2019, they have started chemotherapy and/or immunotherapy and were observed until the end of 2021, we wish to review their survival time. So far, we have seen that survival analysis contains a starting period, in which the patients are enrolled, followed by the observation period or follow-up, when the patients are observed. Besides these two stages, there exists another one named the final period. In this stage the collected data is analyzed, and conclusions are drawn.

Let us presume that from the 30 lung cancer patients, 6 of them left the study during the observation period. The 6 patients will be excluded from the statistical analysis process. From the remaining 24 patients, 14 survived and 10 passed away. This observation is depicted in a tree structure diagram, as the one presented in Fig. 2.3.

Fig. 2.3
A diagram of survival analysis. Out of 34 patients with lung cancer, 14 patients survived and 10 patients died.

Survival analysis tree structure diagram

A more thorough tree diagram will include even the patients that left the study. See Fig. 2.4.

Fig. 2.4
A diagram of survival analysis. Out of 30 patients with lung cancer, 14 patients survived, 10 patients died, and six patients were censored.

Survival analysis tree diagram 2

We can compute the death rate or the death risk using the following formula:

$$ death rate = \frac{number of deaths}{{number of subjects}}. $$

In our example the death rate is 0.41, that is 41%.

The death probability is computed as:

$$ Death probability = \frac{D}{N - 0.5 \times L} $$

where N is the cohort size, D is the number of deaths, and L is the number of patients that left the study during the observation period. The death probability in our case is 0.37, that is 37%.

Having computed the death risk, we can compute the survival probability as 1—death probability for that interval. By plotting the cumulative survival probability, we obtain the survival curve. The curve starts at 1 meaning that all patients are alive and approaches 0 as patients start to die. In the following sections we shall discuss more about survival curves, starting with the Kaplan–Meier survival curves.

2.3 Kaplan–Meier Survival Curve

Kaplan–Meier curves were invented in 1958 by Edward L. Kaplan and Paul Meier, and they can be used if the data is incomplete [10]. They represent the standard for reporting the survival rate of patients, being used in over 70% of the oncology papers [11].

Kaplan–Meier curves use three types of data regarding the patient: the date the patient entered the study, the last date of observation (i.e. the last time the patient was seen alive), and whether the last observation was due to the death of the patient, or because the patient left the study. We can use Kaplan–Meier curves to determine the survival probability of a patient given certain conditions. For example, by recording the survival times of patients that undergo chemotherapy, we can compute the probability of a new patient to survive a certain period of time if she/he undergoes the same protocol.

We denote the survival time with a random variable X. \(P_{n}\) is the probability of a patient to survive the nth day after the last chemotherapy session, conditioned by the fact that she/he survived all the other \(n - 1\) days before that. \(\overline{{P_{n} }}\) is the total probability of surviving all the n days, and we compute it as follows:

$$ \overline{{P_{n} }} = P_{1} \cdot P_{2} \cdot \ldots \cdot P_{n} $$

We compute the intermediate survival probabilities using:

$$ p_{k } = p_{k - 1} \times \frac{{r_{k} - f_{k} }}{{r_{k} }}, $$

where \(p_{k}\) is the probability of surviving k units of time, \(r_{k}\) is the number of patients with a death risk at the k moment, that survived k units of time, and \(f_{k}\) the number of deaths reported at the k moment. If no patient has died the survival rate is 100%. We compute the standard error of the probability of surviving using:

$$ SE_{pk} = p_{k} \cdot \sqrt {\frac{{\left( {1 - p_{k} } \right)}}{{r_{k} }}.} $$

If we presume that \(p_{k}\) is governed by the Normal distribution, than we can compute the 95% interval as follows:

$$ \left( {p_{k} - 1.96 \times SE_{{p_{k} }} , p_{k} + 1.96 \times SE_{{p_{k} }} } \right) $$

The standard error does not always give accurate approximations if there are extreme values in the data sample. If this is the case, the Greenwood formula is preferred:

$$ SE_{{p_{k} }} = p_{k} \cdot \sqrt {\mathop \sum \limits_{j = 1}^{k} \frac{{f_{j} }}{{r_{j} \cdot \left( {r_{j} - f_{j} } \right)}}.} $$

Let us presume that we have a sample data that contains 16 patients that have been diagnosed with stage IV lung cancer. All the patients have undergone chemotherapy treatment with a certain type of drug, drug A. The patients are monitored 14 months. We start the survival analysis from day 0 (Table 2.3).

Table 2.3 Life table for lung cancer patients that underwent chemotherapy with drug A

Using the survival probability formula, we compute the survival probability at a given time. Table 2.4 presents these calculations.

Table 2.4 Survival probabilities, standard error and confidence interval

The corresponding Kaplan–Meier curve is plotted in Fig. 2.5.

Fig. 2.5
A graph plots cumulative survival versus time. The line for A fluctuates between 0 and 1 in survival with an increase in time between 0 and 10. The survival function time is indicated between 5 and 10.

Kaplan–Meier survival curve

Kaplan–Meier survival curves have a drawback: if we wish to compare two or multiple sample data, we can obtain only a comparison at a certain moment in time, not a global one. To resolve this issue, we can use the logrank test, or hazard ratio.

2.4 The Logrank Test

The logrank test is a non-parametric test that uses the null hypothesis \(H_{0}\): “there is no difference between the two groups”. We perform this test by dividing the time scale according to observed events (i.e. deaths), while ignoring the censored data. For each interval we compute the observed number of deaths and the expected one, summing them up.

If we have two groups of patients, each group receives a certain chemotherapy drug. We will divide the survival time in time periods. Each period ends with one or multiple deaths. For each death unit, and each patient group, we compute the number of patients that are at death risk. Let us denote with \(r_{1}\) the number of patients with death risk for sample group 1, and with \(r_{2}\) the number of patients with death risk for sample group 2. Next, we compute the number of observed deaths for each group, \(f_{1}\) and \(f_{2}\). With this information we proceed on building the following table, Table 2.5.

Table 2.5 Death and survival data regarding the two sample groups of patients

The expected number of deaths for each group is computed using the following formula:

$$ e_{i} = \frac{{r_{i} \cdot f}}{r}, i = 1, 2. $$

Next, we must sum up the observed values, \(O_{i}\), as well as the expected values, \(E_{i}\):

$$ \begin{aligned} & O_{i} = \mathop \sum \limits_{j} f_{ji} , \quad i = 1,2, \\ & E_{i} = \mathop \sum \limits_{j} e_{ji} ,\quad i = 1,2. \\ \end{aligned} $$

The logrank statistics is computed as follows:

$$ X^{2} = \frac{{\left( {O_{1} - E_{1} } \right)^{2} }}{{E_{1} }} + \frac{{\left( {O_{2} - E_{2} } \right)^{2} }}{{E_{2} }}. $$

For multiple groups we will use:

$$ T = \mathop \sum \limits_{j = 1}^{n} \mathop \sum \limits_{i = 1}^{m} \frac{{\left( {O_{ij } - E_{ij} } \right)^{2} }}{{E_{ij} }}. $$

To verify whether we accept or reject the null hypothesis, we will use the \(O_{1} + O_{2} = E_{1} + E_{2}\) as control equality. The statistic value is compared with a \(\chi^{2}\) distribution with \(\left( {n - 1} \right)\left( {m - 1} \right) \) degrees of freedom. n represents the number of groups, whereas m represents the number of time intervals [12, 13].

Let us exemplify how the logrank test works. Let us presume that we are conducting a clinical trial, with two types of immunotherapy drugs, A and B. For 14 months we have monitored the two groups of patients diagnosed with IV grade lung cancer. The first group contains the 16 patients from the Kaplan–Meier section, whereas the second group contains 12 patients. Table 2.6 presents the data regarding the two sample groups.

Table 2.6 Life table for lung cancer patients that underwent chemotherapy with drug B

Using the survival probability formula, we compute the survival probability at a given time. Table 2.7 presents these calculations.

Table 2.7 Survival probabilities, standard error and confidence interval

The corresponding Kaplan–Meier curve is plotted in Fig. 2.6.

Fig. 2.6
A graph plots cumulative survival versus time. The line for B fluctuates between 0.5 and 1 in survival with an increase in time. The survival function time is indicated above 10.

Kaplan–Meier survival curve second group

For a better comparison we shall plot both curves in the same plot. Figure 2.7 present this plot.

Fig. 2.7
A graph plots cumulative survival versus time. The line for A fluctuates between 0 and 1 in survival with an increase in time between 2 and 12. The line for B fluctuates between 0.6 and 1 in survival with an increase in time between 0 and 14. The survival function time is marked between 9, 10 for A and 11, 14 for B.

Kaplan–Meier curves for both sample patients

Using the logrank equations we can build the following table, Table 2.8:

Table 2.8 Goodness of fit table

The Goodness of Fit histogram is presented in Fig. 2.8.

Fig. 2.8
A histogram plots values versus group 1 and group 2. The values observed and expected in group 1 are 12 and 7. The values observed and expected in group 2 are 4 and 9. All values are approximate.

Goodness of fit histogram

The test statistic \(\chi^{2}\) equals 6.829, while the p-level equals 0.0089. This means that we will reject the null hypothesis, implying that there are significant differences between the survival rates of the patients that use drug A, versus drug B.

2.5 The Hazard Ratio

Besides the information regarding the difference between two groups of observations, we might be interested in seeing how truly different the two groups are. In this matter, we cannot use the logrank test, but we can apply the hazard ratio. Technically, we will measure the relative survival between the two groups by comparing the observed and expected numbers [14,15,16,17,18,19]. The hazard ratio is computed using the following formula:

$$ R = \frac{{{\raise0.7ex\hbox{${O_{1} }$} \!\mathord{\left/ {\vphantom {{O_{1} } {E_{1} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${E_{1} }$}}}}{{{\raise0.7ex\hbox{${O_{2} }$} \!\mathord{\left/ {\vphantom {{O_{2} } {E_{2} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${E_{2} }$}}}}. $$

Returning to our example, we obtain the following results (Fig. 2.9; Table 2.9).

Fig. 2.9
A graph plots numbers without events versus time periods since the beginning. The line for drug A fluctuates between 8 and 12 in number with an increase in time periods between 0 and 14. The line for drug B fluctuates between 0 and 16 in number with an increase in time periods between 0 and 14.

Hazard ratio—Time to event curves

Table 2.9 Hazard ratio results

The obtained results show that the estimated relative risk of dying when undergoing imunotherapy with drug A is 3.1839 of the estimated relative risk of dying when undergoing chemotherapy with drug B. More specifically, if the hazard ratio equals 1, it means that there is no difference in survival rates / event rate over time between the two sample groups. If the hazard ratio is greater than 1, just like in our example, then the risk of having an event is greater in the group that uses drug A versus the group that uses drug B. Please note that the hazard ratio indicates an increase of hazard when using drug A, which is an increase in the rate of the event, not the chances of it happening.

2.5.1 Cox Regression Model

The chapter ends with the Cox’s proportional hazard regression model or Cox regression, which creates a survival function that gives us a certain event’s probability (i.e. death, irresponsive to treatment) to happen at a particular time t. Having previously observed and recorded data, we can build the model, and afterwards use it to make predictions on new patients. Cox regression can analyze multiple factors. Some may think that instead of using the Cox regression, we might be able to use the multilinear regression. This is not possible due to the following:

  • in general, samples that contain survival times have exponential or Weibull distributions, and multiple linear regression cannot be applied unless the sample data is governed by the Gaussian distribution.

  • Survival times contain censored data.

When using the Cox proportional hazard regression method, we need to compute the survival function and the hazard function. We compute the survival function as it follows:

$$ S\left( t \right) = \left\{ {T > t} \right\}, $$

where t is the time, and T is the time remaining till the patient’s death. Hence, we can write the lifetime distribution as:

$$ F\left( t \right) = 1 - S\left( t \right). $$

We compute the number of deaths per time unit as: \(f\left( t \right) = \frac{d}{dt}F\left( t \right)\). The hazard function is:

$$ \lambda \left( t \right) = P\left\{ {t < T < t + dt} \right\} = \frac{f\left( t \right)dt}{{S\left( t \right)}} = - \frac{{S^{\prime}\left( t \right)dt}}{S\left( t \right)}. $$

Practically, by computing the hazard function we find the patient’s death risk within the timeframe dt, when previously given T time left to live. The Cox regression model presumes that variables within the hazard function are independent and have a constant effect over the time of the survival, and each of them can be a predictor or covariance:

$$ h\left( {t; Z_{1} , Z_{2} , \ldots , Z_{k} } \right) = h_{0} \left( t \right) \cdot \exp \left( {b_{1} Z_{1} + b_{2} Z_{2} + \cdots + b_{k} Z_{k} } \right). $$

The function can be afterwards transformed into:

$$ ln\left[ {\frac{{h(t; Z_{1} , Z_{2} , \ldots , Z_{k}) }}{{h_{0} \left( t \right)}}} \right] = b_{1} Z_{1} + b_{2} Z_{2} + \cdots + b_{k} Z_{k} . $$

The \(h_{0} \left( t \right)\) is the underlying hazard function and represents the hazard when all the variables equal 0. Two assumptions must be fulfilled [20,21,22]:

  • The hazard and the independent variables have a log-linear relationship;

  • The hypothesis of proportionality: the relationship between the underlying hazard function and log-linear function of covariates exists.

Let us see how the Cox regression works on another fictional example. Our sample data contains 17 patients diagnosed with lung cancer. The dataset contains four attributes, four predictor variables (time, age, number of affected lymph nodes, number of months that have passed since the surgery), and the categorical variable (survival). The data is presented in Table 2.10.

Table 2.10 Fictional lung cancer patient dataset

First, we were interested in plotting the Kaplan–Meier curve to see the survival after the oncological lung surgery. Figure 2.10 show the curve together with 95% confidence interval.

Fig. 2.10
A graph plots the timeline of the Kalpan-Meier estimate. The line of estimate fluctuates between 1.0 and 0.0 with an increase in the timeline. The fluctuated line is shaded.

Kaplan–Meier curve and 95% confidence interval for survival after oncological lung surgery

Next, we built two cohorts of patients. The first had no cancerous lymph node detected, the other had more than one. We have plotted the survival curve for both groups using Kaplan–Meier (Fig. 2.11).

Fig. 2.11
A graph plots the timeline of the Kalpan-Meier estimate. The lines for at least one positive axillary detected and no positive axillary nodes detected fluctuate between 1.0 and 0.0 with an increase in the timeline.

Kaplan–Meier curve and 95% confidence interval for two lung patient cohorts

We have applied the Cox regression model having as event the Survival attribute, and as duration the Time attribute. The obtained results are in Table 2.11. The summary statistic table indicates the significance of the covariates in predicting the Survival risk. The large confidence interval indicates that the sample data is small. The p-level shows us that the number of months that have passed since the oncological surgery is significant, while the others are not. The hazard ratio for this attribute is 0.71 showing a strong relationship between the number of months that have passed since the surgery and decreased risk of death. Notice that the hazard ration for Age is 1.01, which suggests only a 1% increase for the higher age group. Technically:

Table 2.11 Results of Cox hazard regression
  • Hazard ratio = 1: no effect

  • Hazard ratio < 1: reduction in the hazard

  • Hazard ratio > 1: increase in hazard.

Let us see now which attributes affect the most from the following plot (Fig. 2.12):

Fig. 2.12
A graph plots significant attributes versus log H R 95 percent C I. The line for age plots varies from negative 0.2 to 0.2. The line for nodes varies from negative 0.6 to 0.4. The line for surgery varies from negative 0.6 to negative 0.1. All values are approximate.

Significant attributes

From Fig. 2.12, we can clearly see that the number of months that have passed since the surgery is indeed significant, while the others are not. As a final note, we have plotted the survival probabilities for different persons in our dataset. From the graph (Fig. 2.13), we can see that patient 13 has the highest chances of survival, whereas patient 8 has the lowest.

Fig. 2.13
A graph plots survival probabilities for patients. The line for 13 depicts a decreasing trend between 0.2 and 1.0. The lines for 0, 4, and 8 fluctuate between 1.0 and 0.0 on the y-axis. Values are approximated.

Survival probabilities for patients: 0, 4, 8, and 13

2.6 Conclusions

This chapter provides a survival analysis guide with applications in oncology. Survival analysis represents an important part of cancer research. It can be applied to determine the survival rate of patients, to determine which treatment protocol is more efficient, or to establish whether new therapies are indeed better than the old ones. Using survival analysis in clinical trials we can move forward in providing the best care for cancer patients.

In this chapter we have discussed the theory behind Kaplan–Meier survival curves, logrank test, hazard ration, and Cox regression, as well as practical examples. We hope that this chapter will provide data scientists and oncologists a better understanding of the survival analysis process.