Abstract
Breast cancer is one of the most common cancers in women worldwide, which causes an enormous number of deaths annually. However, early diagnosis of breast cancer can improve survival outcomes enabling simpler and more cost-effective treatments. The recent increase in data availability provides unprecedented opportunities to apply data-driven and machine learning methods to identify early-detection prognostic factors capable of predicting the expected survival and potential sensitivity to treatment of patients, with the final aim of enhancing clinical outcomes. This tutorial presents a protocol for applying machine learning models in survival analysis for both clinical and transcriptomic data. We show that integrating clinical and mRNA expression data is essential to explain the multiple biological processes driving cancer progression. Our results reveal that machine-learning-based models such as random survival forests, gradient boosted survival model, and survival support vector machine can outperform the traditional statistical methods, i.e., Cox proportional hazard model. The highest C-index among the machine learning models was recorded when using survival support vector machine, with a value 0.688, whereas the C-index recorded using the Cox model was 0.677. Shapley Additive Explanation (SHAP) values were also applied to identify the feature importance of the models and their impact on the prediction outcomes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ferlay J, Héry C, Autier P, Sankaranarayanan R (2010) Global burden of breast cancer. In: Breast cancer epidemiology. Springer, pp 1–19
Cancer Research UK (2021) Breast cancer statistics. URL https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/breast-cancer
Office for National Statistics (2019) Cancer survival in England Cancer survival in England: national estimates for patients followed up to 2017. URL https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/cancersurvivalinengland/nationalestimatesforpatientsfollowedupto2017
Robson M, Im SA, Senkus E, et al (2017) Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. New Engl J Med 377(6):523–533
De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33(30):5310–5329
Hira MT, Razzaque M, Angione C et al (2021) Integrated multi-omics analysis of ovarian cancer using variational autoencoders. Sci Rep 11(1):1–16
Conesa A, Beck S (2019) Making multi-omics data accessible to researchers. Sci Data 6(1):1–4
Vijayakumar S, Conway M, Lió P, Angione C (2018) Optimization of multi-omic genome-scale models: Methodologies, hands-on tutorial, and perspectives. Metabolic Netw Reconstr Model 1716:389–408
Angione C (2019) Human systems biology and metabolic modelling: a review–from disease metabolism to precision medicine. BioMed Res Int 2019
Zhao Z, Zhang KN, Wang Q et al (2021) Chinese Glioma Genome Atlas (CGGA): a comprehensive resource with functional genomic data from Chinese glioma patients. Genomics, proteomics Bioinformatics 19(1):1
Iuliano A, Occhipinti A, Angelini C et al (2018) Combining pathway identification and breast cancer survival prediction via screening-network methods. Front Genet 9:206
Győrffy B (2021) Survival analysis across the entire transcriptome identifies biomarkers with the highest prognostic power in breast cancer. Comput Struct Biotechnol J 19:4101–4109
Higdon R, Earl RK, Stanberry L et al (2015) The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. Omics J Integr Biol 19(4):197–208
Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18(1):1–15
Yaneske E, Angione C (2018) The poly-omics of ageing through individual-based metabolic modelling. BMC Bioinf 19(14):83–96
Yan J, Risacher SL, Shen L, Saykin AJ (2018) Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Brief Bioinf 19(6):1370–1381
Occhipinti A, Hamadi Y, Kugler H et al (2020) Discovering essential multiple gene effects through large scale optimization: an application to human cancer metabolism. IEEE/ACM Trans Comput Biol Bioinf 18:2339
Eyassu F, Angione C (2017) Modelling pyruvate dehydrogenase under hypoxia and its role in cancer metabolism. R Soc Open Sci 4(10):170360
Zhao L, Dong Q, Luo C et al (2021) DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J 19:2719–2725
Yaneske E, Zampieri G, Bertoldi L et al (2021) Genome-scale metabolic modelling of SARS-CoV-2 in cancer cells reveals an increased shift to glycolytic energy production. FEBS Lett 595(18):2350–2365
Angione C (2018) Integrating splice-isoform expression into genome-scale models characterizes breast cancer metabolism. Bioinformatics 34(3):494–501
Anaya J, Reon B, Chen WM et al (2016) A pan-cancer analysis of prognostic genes. PeerJ 3:e1499
Zhu B, Song N, Shen R et al (2017) Integrating clinical and multiple omics data for prognostic assessment across human cancers. Sci Rep 7(1):1–13
Islam MM, Haque MR, Iqbal H et al (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):1–14
Zampieri G, Vijayakumar S, Yaneske E, Angione C (2019) Machine and deep learning meet genome-scale metabolic modeling. PLoS Comput Biol 15(7):e1007084
Alabi RO, Elmusrati M, Sawazaki-Calone I et al (2020) Comparison of supervised machine learning classification techniques in prediction of locoregional recurrences in early oral tongue cancer. Int J Med Informatics 136:104068
Culley C, Vijayakumar S, Zampieri G, Angione C (2020) A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth. Proc Natl Acad Sci 117(31):18869–18879
Chugh G, Kumar S, Singh N (2021) Survey on machine learning and deep learning applications in breast cancer diagnosis. Cogn Comput:1–20
Akram M, Iqbal M, Daniyal M, Khan AU (2017) Awareness and current knowledge of breast cancer. Biol Res 50(1):1–23
Simmons CP, McMillan DC, McWilliams K et al (2017) Prognostic tools in patients with advanced cancer: a systematic review. J Pain Symptom Manag 53(5):962–970
Ascolani G, Occhipinti A, Liò P (2015) Modelling circulating tumour cells for personalised survival prediction in metastatic breast cancer. PLoS Comput Biol 11(5):e1004199
Wang P, Li Y, Reddy CK (2019) Machine learning for survival analysis: A survey. ACM Comput Surv (CSUR) 51(6):1–36
Mariotto AB, Noone AM, Howlader N et al (2014) Cancer survival: an overview of measures, uses, and interpretation. J Natl Cancer Inst Monographs 2014(49):145–186
Austin PC (2017) A tutorial on multilevel survival analysis: methods, models and applications. Int Stat Rev 85(2):185–203
Iuliano A, Occhipinti A, Angelini C et al (2016) Cancer markers selection using network-based cox regression: a methodological and computational practice. Front Physiol 7:208
Yang Y, Lu Q, Shao X et al (2018) Development of a three-gene prognostic signature for hepatitis b virus associated hepatocellular carcinoma based on integrated transcriptomic analysis. J Cancer 9(11):1989
Kiebish MA, Cullen J, Mishra P et al (2020) Multi-omic serum biomarkers for prognosis of disease progression in prostate cancer. J Transl Med 18(1):1–10
Hao J, Kim Y, Mallavarapu T et al (2019) Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med Genomics 12(10):1–13
Moncada-Torres A, van Maaren MC, Hendriks MP et al (2021) Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep 11(1):1–13
Akai H, Yasaka K, Kunimatsu A et al (2018) Predicting prognosis of resected hepatocellular carcinoma by radiomics analysis with random survival forest. Diagn Interv imaging 99(10):643–651
Bibault JE, Chang DT, Xing L (2021) Development and validation of a model to predict survival in colorectal cancer using a gradient-boosted machine. Gut 70(5):884–889
Wang H, Zheng B, Yoon SW, Ko HS (2018) A support vector machine-based ensemble algorithm for breast cancer diagnosis. Eur J Oper Res 267(2):687–699
Ching T, Zhu X, Garmire LX (2018) Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput Biol 14(4):e1006076
Huang Z, Zhan X, Xiang S et al (2019) SALMON: survival analysis learning with multi-omics neural networks on breast cancer. Front Genet 10:166
Cheon S, Agarwal A, Popovic M et al (2016) The accuracy of clinicians’ predictions of survival in advanced cancer: a review. Ann Palliat Med 5(1):22–29
Pereira B, Chin SF, Rueda OM et al (2016) The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun 7(1):1–16. https://doi.org/10.1038/ncomms11479
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems, pp 4768–4777
Singh R, Mukhopadhyay K (2011) Survival analysis in clinical trials: Basics and must know areas. Perspect Clin Res 2(4):145
Cox DR (1972) Regression models and life-tables. J R Stat Soc B (Methodol) 34(2):187–202
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Annals Appl Stat 2(3):841–860
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Azar AT, Elshazly HI, Hassanien AE, Elkorany AM (2014) A random forest classifier for lymph diseases. Comput Methods Programs Biomed 113(2):465–473
Qu Z, Li H, Wang Y et al (2020) Detection of electricity theft behavior based on improved synthetic minority oversampling technique and random forest classifier. Energies 13(8):2039
Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546
Hothorn T, Bühlmann P, Dudoit S et al (2006) Survival ensembles. Biostatistics 7(3):355–373
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobotics 7:21
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Ridgeway G (1999) The state of boosting. Comput Sci Stat:172–181
Khan FM, Zubek VB (2008) Support vector regression for censored data (SVRC): a novel tool for survival analysis. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 863–868
Vapnik V (1999) The nature of statistical learning theory. Springer Science & Business Media
Pölsterl S, Navab N, Katouzian A (2015) Fast training of support vector machines for survival analysis. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 243–259
Leger S, Zwanenburg A, Pilz K et al (2017) A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling. Sci Rep 7(1):1–11
Gárate-Escamila AK, El Hassani AH, Andrès E (2020) Classification models for heart disease prediction using feature selection and PCA. Informatics Med Unlocked 19:100330
Ewees AA, Al-qaness MA, Abualigah L et al (2021) Boosting arithmetic optimization algorithm with genetic algorithm operators for feature selection: Case study on Cox proportional hazards model. Mathematics 9(18):2321
Schemper M, Kaider A, Wakounig S, Heinze G (2013) Estimating the correlation of bivariate failure times under censoring. Stat Med 32(27):4781–4790
Su Z, Tang B, Liu Z, Qin Y (2015) Multi-fault diagnosis for rotating machinery based on orthogonal supervised linear local tangent space alignment and least square support vector machine. Neurocomputing 157:208–222
Rodrigues D, Pereira LA, Nakamura RY et al (2014) A wrapper approach for feature selection based on Bat algorithm and optimum-path forest. Expert Syst Appl 41(5):2250–2258
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Curtis C, Shah SP, Chin SF et al (2012) The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403):346–352
Pölsterl S (2020) scikit-survival: A library for time-to-event analysis built on top of scikit-learn. J Mach Learn Res 21(212):1–6
Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA
Kim B, Khanna R, Koyejo OO (2016) Examples are not enough, learn to criticize! Criticism for Interpretability. In: Advances in neural information processing systems, vol 29
Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DKW, Newman SF, Kim J, et al (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760
Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinformatics 11(2):253–264
Fryett JJ, Inshaw J, Morris AP, Cordell HJ (2018) Comparison of methods for transcriptome imputation through application to two common complex diseases. Eur J Hum Genet 26(11):1658–1667
Shahjaman M, Rahman MR, Islam T et al (2021) rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data. Comput Biol Med 138:104911
Park S, Shin B, Shim WS et al. (2019) Wx: a neural network-based feature selection algorithm for transcriptomic data. Sci Rep 9(1):1–9
Han Y, Huang L, Zhou F (2021) Zoo: Selecting transcriptomic and methylomic biomarkers by ensembling animal-inspired swarm intelligence feature selection algorithms. Genes 12(11):1814
Iuliano A, Occhipinti A, Angelini C et al (2021) COSMONET: An R package for survival analysis using screening-network methods. Mathematics 9(24):3262
Katzman JL, Shaham U, Cloninger A et al (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18(1):1–12
Poirion OB, Jing Z, Chaudhary K et al (2021) DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med 13(1):1–15
Acknowledgements
AO and CA acknowledge the support of Earlier.org through their Research Grant “Application of computational models of breast cancer for early-detection personalised tests.” CA acknowledges the support of EPSRC and The Alan Turing Institute through their Turing Network Development Award, and the Children’s Liver Disease Foundation through their Research Grant.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Doan, L.M.T., Angione, C., Occhipinti, A. (2023). Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer. In: Selvarajoo, K. (eds) Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology. Methods in Molecular Biology, vol 2553. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2617-7_16
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2617-7_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2616-0
Online ISBN: 978-1-0716-2617-7
eBook Packages: Springer Protocols