Abstract
Survival analysis has a crucial role in oncology. These statistical methods are ubiquitous in oncology, helping physicians determine the death risk, the best course of treatment, and even help in discovering new therapies. The aim of this chapter it to provide a guide on some survival analysis statistical methods as well as practical examples on how to apply them in oncology. We begin with some theory regarding survival analysis in general, followed by Kaplan–Meier survival curves, logrank test and hazard ration to determine the best course of treatment. The chapter ends with the Cox regression. This chapter is a step-by-step guide in performing survival analysis and in interpreting the obtained results.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
2.1 Introduction
Cancer remains the main health challenge we are facing nowadays. Seems like every day you find out that another person you know has been diagnosed with cancer. After the shock wave, you feel sad for that person, and guilty because you are somehow happy that this did not happen to you or your loved ones. Even if heart diseases still remain the first cause of death worldwide, the death rate caused by them is rapidly falling. Why is that? Because we found out what causes heart diseases: high blood pressure, high weight, high glucose levels, smoking, drinking, etc. Knowing the cause, we have the means to prevent them. Unfortunately, this is not the case for cancer. We still do not know what causes cancer.
Sadly enough, nowadays most people Google first the symptoms, before speaking to a physician. No matter what we are Googling, cancer will appear on the result page. So, we are alarmed, and while waiting for the doctor’s appointment we start Googling for that type of cancer. The most searched questions are: what is the 5-year survival rate? What is the overall survival? If there is need for surgery, what is the morbidity? What about the mortality rate? So, the number one outcome of interest is survival—overall, disease-free, recurrent, surgery survival.
The Global cancer prevalence rose from 0.54 to 0.64% since 1990. For instance for prostate cancer the rates rose from 67.8% to 98.6%, due to better AI prediction, [1]. Even if the cancer rates are rising, the death rates are falling. This can only mean one thing: early diagnosis and/or better novel treatments, hence people have better and longer survival rates. The 5-year survival rates for all cancers have increased from 50.3% to 67% [2,3,4]. Table 2.1 and Fig. 2.1 present how the 5-year survival rates for different types of cancers have changed from 1970–1977 to 2007–2013 in the USA.
WHO’s global target (25 × 25) is a 25% reduction in deaths from cancer in people aged 30–69 years by 2025 [5]. Cancer survival research is crucial for developing cancer control strategies [6], control measures [7], so that the effectiveness and costs of them to be assessed [8, 9].
Survival analysis concerns the time until a certain event takes place: i.e. the time that passes from the start of chemotherapy until the tumor stops shrinking (the patient stops responding to treatment), the time elapsed from when a cancer surgery is over until the patients gets out of the ICU (Intensive Care Unit), or the time that passed from the moment the patient started radiotherapy, till she/he passed away.
In this chapter, we are going to present statistical models that are used in survival analysis, along with examples and explanations regarding the obtained results.
2.2 Survival Analysis
Survival analysis deals with survival times. In order for you to start such an analysis you need two variables: a numeric metric of time (i.e. number of days, weeks, or months), and a categorical variable that identifies the event (i.e. irresponsive to treatment, death). The merged variables give us a lot of information about whether a subject has entered or left the study, and if and when the subject has met a certain criterion or not.
Unfortunately, in practice things are never that simple, and we might find ourselves in two situations:
-
The start time cannot be specified. For example, we cannot establish exactly when was the exact onset of the disease. Some cancers progress at a faster pace than others, being more aggressive, but there is no precise method to determine the debut of cancer.
-
The end time is difficult to be determined. If the end time is given by the time of death, then there is no problem in establishing it, but if a person decides to leave the study, or she/he survives more than the time that was set to be recorded (i.e. 5-year survival rate), things change. This is the case of the censored survival time.
In Fig. 2.2 we present an example of how we can record the survival times. The figure is converted into a table, Table 2.2.
The dotted line from Fig. 2.2 gives us information about the study: the patients were recruited the first 5 months of the study. The timeline from month 5 till month 9 represents the follow-up part of the study. The black square signifies that the patient has died, whereas the grey circle means that the patient did not die. To be more specific: patient 1 was recruited at the beginning of the study and stayed in the study till month 3, time of which she/he died. The second patient enrolled in the study from month 0, and did not have any event (i.e. died) during the whole observation period, so she/he represents a censored data. Patient 3 also started from the beginning of the study, but died five months later. The fourth patient enrolled on month 1 and stayed until the end of it. The fifth patient enrolled on month 1, and had an event (i.e. death) on month 8. The sixth patient enrolled on month 3, but left the study on month 5. Patient 7 joined the study on month 4 and passed away on month 6. Both patients 8 and 9 joined on month 5, the first leaving the study on month 8 (censored data), while the latter on month 9 (censored).
You can see in Table 2.2 that the censored data are marked with (*).
In general, we use survival analysis in oncology to review the outcomes of clinical trials, cohort studies, etc. For instance, if we have a cohort of 30 patients who have been diagnosed with lung cancer between 2017 and 2019, they have started chemotherapy and/or immunotherapy and were observed until the end of 2021, we wish to review their survival time. So far, we have seen that survival analysis contains a starting period, in which the patients are enrolled, followed by the observation period or follow-up, when the patients are observed. Besides these two stages, there exists another one named the final period. In this stage the collected data is analyzed, and conclusions are drawn.
Let us presume that from the 30 lung cancer patients, 6 of them left the study during the observation period. The 6 patients will be excluded from the statistical analysis process. From the remaining 24 patients, 14 survived and 10 passed away. This observation is depicted in a tree structure diagram, as the one presented in Fig. 2.3.
A more thorough tree diagram will include even the patients that left the study. See Fig. 2.4.
We can compute the death rate or the death risk using the following formula:
In our example the death rate is 0.41, that is 41%.
The death probability is computed as:
where N is the cohort size, D is the number of deaths, and L is the number of patients that left the study during the observation period. The death probability in our case is 0.37, that is 37%.
Having computed the death risk, we can compute the survival probability as 1—death probability for that interval. By plotting the cumulative survival probability, we obtain the survival curve. The curve starts at 1 meaning that all patients are alive and approaches 0 as patients start to die. In the following sections we shall discuss more about survival curves, starting with the Kaplan–Meier survival curves.
2.3 Kaplan–Meier Survival Curve
Kaplan–Meier curves were invented in 1958 by Edward L. Kaplan and Paul Meier, and they can be used if the data is incomplete [10]. They represent the standard for reporting the survival rate of patients, being used in over 70% of the oncology papers [11].
Kaplan–Meier curves use three types of data regarding the patient: the date the patient entered the study, the last date of observation (i.e. the last time the patient was seen alive), and whether the last observation was due to the death of the patient, or because the patient left the study. We can use Kaplan–Meier curves to determine the survival probability of a patient given certain conditions. For example, by recording the survival times of patients that undergo chemotherapy, we can compute the probability of a new patient to survive a certain period of time if she/he undergoes the same protocol.
We denote the survival time with a random variable X. \(P_{n}\) is the probability of a patient to survive the nth day after the last chemotherapy session, conditioned by the fact that she/he survived all the other \(n - 1\) days before that. \(\overline{{P_{n} }}\) is the total probability of surviving all the n days, and we compute it as follows:
We compute the intermediate survival probabilities using:
where \(p_{k}\) is the probability of surviving k units of time, \(r_{k}\) is the number of patients with a death risk at the k moment, that survived k units of time, and \(f_{k}\) the number of deaths reported at the k moment. If no patient has died the survival rate is 100%. We compute the standard error of the probability of surviving using:
If we presume that \(p_{k}\) is governed by the Normal distribution, than we can compute the 95% interval as follows:
The standard error does not always give accurate approximations if there are extreme values in the data sample. If this is the case, the Greenwood formula is preferred:
Let us presume that we have a sample data that contains 16 patients that have been diagnosed with stage IV lung cancer. All the patients have undergone chemotherapy treatment with a certain type of drug, drug A. The patients are monitored 14 months. We start the survival analysis from day 0 (Table 2.3).
Using the survival probability formula, we compute the survival probability at a given time. Table 2.4 presents these calculations.
The corresponding Kaplan–Meier curve is plotted in Fig. 2.5.
Kaplan–Meier survival curves have a drawback: if we wish to compare two or multiple sample data, we can obtain only a comparison at a certain moment in time, not a global one. To resolve this issue, we can use the logrank test, or hazard ratio.
2.4 The Logrank Test
The logrank test is a non-parametric test that uses the null hypothesis \(H_{0}\): “there is no difference between the two groups”. We perform this test by dividing the time scale according to observed events (i.e. deaths), while ignoring the censored data. For each interval we compute the observed number of deaths and the expected one, summing them up.
If we have two groups of patients, each group receives a certain chemotherapy drug. We will divide the survival time in time periods. Each period ends with one or multiple deaths. For each death unit, and each patient group, we compute the number of patients that are at death risk. Let us denote with \(r_{1}\) the number of patients with death risk for sample group 1, and with \(r_{2}\) the number of patients with death risk for sample group 2. Next, we compute the number of observed deaths for each group, \(f_{1}\) and \(f_{2}\). With this information we proceed on building the following table, Table 2.5.
The expected number of deaths for each group is computed using the following formula:
Next, we must sum up the observed values, \(O_{i}\), as well as the expected values, \(E_{i}\):
The logrank statistics is computed as follows:
For multiple groups we will use:
To verify whether we accept or reject the null hypothesis, we will use the \(O_{1} + O_{2} = E_{1} + E_{2}\) as control equality. The statistic value is compared with a \(\chi^{2}\) distribution with \(\left( {n - 1} \right)\left( {m - 1} \right) \) degrees of freedom. n represents the number of groups, whereas m represents the number of time intervals [12, 13].
Let us exemplify how the logrank test works. Let us presume that we are conducting a clinical trial, with two types of immunotherapy drugs, A and B. For 14 months we have monitored the two groups of patients diagnosed with IV grade lung cancer. The first group contains the 16 patients from the Kaplan–Meier section, whereas the second group contains 12 patients. Table 2.6 presents the data regarding the two sample groups.
Using the survival probability formula, we compute the survival probability at a given time. Table 2.7 presents these calculations.
The corresponding Kaplan–Meier curve is plotted in Fig. 2.6.
For a better comparison we shall plot both curves in the same plot. Figure 2.7 present this plot.
Using the logrank equations we can build the following table, Table 2.8:
The Goodness of Fit histogram is presented in Fig. 2.8.
The test statistic \(\chi^{2}\) equals 6.829, while the p-level equals 0.0089. This means that we will reject the null hypothesis, implying that there are significant differences between the survival rates of the patients that use drug A, versus drug B.
2.5 The Hazard Ratio
Besides the information regarding the difference between two groups of observations, we might be interested in seeing how truly different the two groups are. In this matter, we cannot use the logrank test, but we can apply the hazard ratio. Technically, we will measure the relative survival between the two groups by comparing the observed and expected numbers [14,15,16,17,18,19]. The hazard ratio is computed using the following formula:
Returning to our example, we obtain the following results (Fig. 2.9; Table 2.9).
The obtained results show that the estimated relative risk of dying when undergoing imunotherapy with drug A is 3.1839 of the estimated relative risk of dying when undergoing chemotherapy with drug B. More specifically, if the hazard ratio equals 1, it means that there is no difference in survival rates / event rate over time between the two sample groups. If the hazard ratio is greater than 1, just like in our example, then the risk of having an event is greater in the group that uses drug A versus the group that uses drug B. Please note that the hazard ratio indicates an increase of hazard when using drug A, which is an increase in the rate of the event, not the chances of it happening.
2.5.1 Cox Regression Model
The chapter ends with the Cox’s proportional hazard regression model or Cox regression, which creates a survival function that gives us a certain event’s probability (i.e. death, irresponsive to treatment) to happen at a particular time t. Having previously observed and recorded data, we can build the model, and afterwards use it to make predictions on new patients. Cox regression can analyze multiple factors. Some may think that instead of using the Cox regression, we might be able to use the multilinear regression. This is not possible due to the following:
-
in general, samples that contain survival times have exponential or Weibull distributions, and multiple linear regression cannot be applied unless the sample data is governed by the Gaussian distribution.
-
Survival times contain censored data.
When using the Cox proportional hazard regression method, we need to compute the survival function and the hazard function. We compute the survival function as it follows:
where t is the time, and T is the time remaining till the patient’s death. Hence, we can write the lifetime distribution as:
We compute the number of deaths per time unit as: \(f\left( t \right) = \frac{d}{dt}F\left( t \right)\). The hazard function is:
Practically, by computing the hazard function we find the patient’s death risk within the timeframe dt, when previously given T time left to live. The Cox regression model presumes that variables within the hazard function are independent and have a constant effect over the time of the survival, and each of them can be a predictor or covariance:
The function can be afterwards transformed into:
The \(h_{0} \left( t \right)\) is the underlying hazard function and represents the hazard when all the variables equal 0. Two assumptions must be fulfilled [20,21,22]:
-
The hazard and the independent variables have a log-linear relationship;
-
The hypothesis of proportionality: the relationship between the underlying hazard function and log-linear function of covariates exists.
Let us see how the Cox regression works on another fictional example. Our sample data contains 17 patients diagnosed with lung cancer. The dataset contains four attributes, four predictor variables (time, age, number of affected lymph nodes, number of months that have passed since the surgery), and the categorical variable (survival). The data is presented in Table 2.10.
First, we were interested in plotting the Kaplan–Meier curve to see the survival after the oncological lung surgery. Figure 2.10 show the curve together with 95% confidence interval.
Next, we built two cohorts of patients. The first had no cancerous lymph node detected, the other had more than one. We have plotted the survival curve for both groups using Kaplan–Meier (Fig. 2.11).
We have applied the Cox regression model having as event the Survival attribute, and as duration the Time attribute. The obtained results are in Table 2.11. The summary statistic table indicates the significance of the covariates in predicting the Survival risk. The large confidence interval indicates that the sample data is small. The p-level shows us that the number of months that have passed since the oncological surgery is significant, while the others are not. The hazard ratio for this attribute is 0.71 showing a strong relationship between the number of months that have passed since the surgery and decreased risk of death. Notice that the hazard ration for Age is 1.01, which suggests only a 1% increase for the higher age group. Technically:
-
Hazard ratio = 1: no effect
-
Hazard ratio < 1: reduction in the hazard
-
Hazard ratio > 1: increase in hazard.
Let us see now which attributes affect the most from the following plot (Fig. 2.12):
From Fig. 2.12, we can clearly see that the number of months that have passed since the surgery is indeed significant, while the others are not. As a final note, we have plotted the survival probabilities for different persons in our dataset. From the graph (Fig. 2.13), we can see that patient 13 has the highest chances of survival, whereas patient 8 has the lowest.
2.6 Conclusions
This chapter provides a survival analysis guide with applications in oncology. Survival analysis represents an important part of cancer research. It can be applied to determine the survival rate of patients, to determine which treatment protocol is more efficient, or to establish whether new therapies are indeed better than the old ones. Using survival analysis in clinical trials we can move forward in providing the best care for cancer patients.
In this chapter we have discussed the theory behind Kaplan–Meier survival curves, logrank test, hazard ration, and Cox regression, as well as practical examples. We hope that this chapter will provide data scientists and oncologists a better understanding of the survival analysis process.
References
Wulczyn, E., et al.: Predicting prostate cancer specific mortality with artificial intelligence-based Gleason Grading. Commun. Med. 1(10). https://doi.org/10.1038/s43856021-00005-3
Jemal, A., Ward, E.M., Johnson, C.J., Cronin, K.A., Ma, J., Ryerson, A.B., Mariotto, A., Lake, A.J., Wilson, R., Sherman, R.L., Anderson, R.N., Hensley, S.J., Kohler, B.A., Penberthy, L., Feuer, E.J., Weir, H.K.: Annual report to the nation of the status of cancer, 1975–2014, feature survival, JNCI: J. Natl. Cancer Inst. 109, 9 (2017). https://doi.org/10.1093/jnci/djx030
Allemani, C., Weirm H.K., Carreira, H., Harewood, R., Spika, D., Wang, X.S., Bannon, F., Ahn, J.V., Johnson, C.J., Bonaventure, A., Gragera, R.M., Stiller, C., Azevedo e Silva, G., Chen, W.Q., Ogunbiyi, O., Rachet, B., Soeberg, M., You, H., Coleman, M.P.: Global surveillance of cancer survival 1995–2009: analysis of individual data for 25 676 887 patients from 279 population-based registries in 67 countries (CONCORD-2). Lancet 385(9972), 977–1010 (2015)
WHO (2012). Decisions and list of resolutions of the 65th World Health Assembly: prevention and control of noncommunicable diseases: follow-up to the high-level meeting of the United Nations General Assembly on the prevention and control of non-communicable diseases (A65/DIV/3). World Health Organization, Geneva
Coleman, M.P., Forman, D., Byrant, H., et al., The ICBP Module 1 Working Group: Cancer survival in Australia, Canada, Denmark, Norway, Sweden, and the UK, 1995–2007 (the International Cancer Benchmarking Partnership): an analysis of population-based cancer registry data. Lancet 377, 127–138 (2011)
Department of Health: Improving Outcomes: a Strategy for Cancer. Department of Health, London (2011)
Rachet, B., Maringe, C., Nur, U., et al.: Population-based cancer survival trends in England and Wales up to 2007: an assessment of the NHS cancer plan for England. Lancet Oncol. 10, 351–369 (2009)
Rachet, B., Ellis, L., Maringe, C., et al.: Socioeconomic inequalities in cancer survival in England after the NHS Cancer Plan. Br. J. Cancer 103, 446–453 (2010)
Kaplan, E.L., Meier, P.: Nonparametric estimations for incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)
Stalpers, L.J., Kaplan, E.L.: Edward L. Kaplan and the Kaplan-Meier survival curve. BSHM Bull. 33, 109–135 (2018)
Bland, J.M., Altman, D.G.: The logrank rest. BMJ 328(74447), 1073 (2004)
Gorunescu, F., Belciug, S.: Incursiune in Biostatistica, Editura Albastra – Grupul Microinformatica (2014)
Altman, D.G.: Practical Statistic for Medical Research. Chapman and Hall (1991)
Belciug, S.: Artificial Intelligence in Cancer: Diagnostic to Tailored Treatment. Elsevier (2020)
Georgiev, G.Z.: One-tailed versus two-tailed tests of significance in A/B testing, https://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/ (2017)
Stare, J., Maucort-Boulch, D.: Odds ratio, hazard ratio and relative risk. Metodoloskizvezki 13(1), 59–67 (2016)
Sashegyi, A., Ferry, D.: On the interpretation of the hazard ratio and communication of survival benefit. Oncologist 22(4), 484–486 (2017)
Spruance, S.L., Reid, J.E., Grace, M., Samore, M.: Hazard ratio in clinical trials. Antimicrob. Agents Chemother. 48(8), 2787–2792 (2004)
Cox, D.R.: Regression models and life tables. J. Roy. Stat. Soc. B 34, 187–202 (1972)
Andersen, P., Gill, R.: Cox’s regression model for counting processes, a large sample study. Ann. Stat. 10, 1100–1120 (1982)
Harrell, F.E., Jr.: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer (2001)
Acknowledgements
The author would like to thank the reviewers for taking their time to review this chapter, and for their important comments that led to a better version of this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Belciug, S. (2023). A Survival Analysis Guide in Oncology. In: Lim, C.P., Vaidya, A., Chen, YW., Jain, V., Jain, L.C. (eds) Artificial Intelligence and Machine Learning for Healthcare. Intelligent Systems Reference Library, vol 229. Springer, Cham. https://doi.org/10.1007/978-3-031-11170-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-11170-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11169-3
Online ISBN: 978-3-031-11170-9
eBook Packages: EngineeringEngineering (R0)