Skip to main content

Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer

  • Protocol
  • First Online:
Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2553))

Abstract

Breast cancer is one of the most common cancers in women worldwide, which causes an enormous number of deaths annually. However, early diagnosis of breast cancer can improve survival outcomes enabling simpler and more cost-effective treatments. The recent increase in data availability provides unprecedented opportunities to apply data-driven and machine learning methods to identify early-detection prognostic factors capable of predicting the expected survival and potential sensitivity to treatment of patients, with the final aim of enhancing clinical outcomes. This tutorial presents a protocol for applying machine learning models in survival analysis for both clinical and transcriptomic data. We show that integrating clinical and mRNA expression data is essential to explain the multiple biological processes driving cancer progression. Our results reveal that machine-learning-based models such as random survival forests, gradient boosted survival model, and survival support vector machine can outperform the traditional statistical methods, i.e., Cox proportional hazard model. The highest C-index among the machine learning models was recorded when using survival support vector machine, with a value 0.688, whereas the C-index recorded using the Cox model was 0.677. Shapley Additive Explanation (SHAP) values were also applied to identify the feature importance of the models and their impact on the prediction outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ferlay J, Héry C, Autier P, Sankaranarayanan R (2010) Global burden of breast cancer. In: Breast cancer epidemiology. Springer, pp 1–19

    Google Scholar 

  2. Cancer Research UK (2021) Breast cancer statistics. URL https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/breast-cancer

  3. Office for National Statistics (2019) Cancer survival in England Cancer survival in England: national estimates for patients followed up to 2017. URL https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/cancersurvivalinengland/nationalestimatesforpatientsfollowedupto2017

  4. Robson M, Im SA, Senkus E, et al (2017) Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. New Engl J Med 377(6):523–533

    Article  CAS  PubMed  Google Scholar 

  5. De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33(30):5310–5329

    Article  PubMed  Google Scholar 

  6. Hira MT, Razzaque M, Angione C et al (2021) Integrated multi-omics analysis of ovarian cancer using variational autoencoders. Sci Rep 11(1):1–16

    Google Scholar 

  7. Conesa A, Beck S (2019) Making multi-omics data accessible to researchers. Sci Data 6(1):1–4

    Article  Google Scholar 

  8. Vijayakumar S, Conway M, Lió P, Angione C (2018) Optimization of multi-omic genome-scale models: Methodologies, hands-on tutorial, and perspectives. Metabolic Netw Reconstr Model 1716:389–408

    CAS  Google Scholar 

  9. Angione C (2019) Human systems biology and metabolic modelling: a review–from disease metabolism to precision medicine. BioMed Res Int 2019

    Google Scholar 

  10. Zhao Z, Zhang KN, Wang Q et al (2021) Chinese Glioma Genome Atlas (CGGA): a comprehensive resource with functional genomic data from Chinese glioma patients. Genomics, proteomics Bioinformatics 19(1):1

    Google Scholar 

  11. Iuliano A, Occhipinti A, Angelini C et al (2018) Combining pathway identification and breast cancer survival prediction via screening-network methods. Front Genet 9:206

    Article  PubMed  PubMed Central  Google Scholar 

  12. Győrffy B (2021) Survival analysis across the entire transcriptome identifies biomarkers with the highest prognostic power in breast cancer. Comput Struct Biotechnol J 19:4101–4109

    Google Scholar 

  13. Higdon R, Earl RK, Stanberry L et al (2015) The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. Omics J Integr Biol 19(4):197–208

    Article  CAS  Google Scholar 

  14. Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18(1):1–15

    Article  Google Scholar 

  15. Yaneske E, Angione C (2018) The poly-omics of ageing through individual-based metabolic modelling. BMC Bioinf 19(14):83–96

    Google Scholar 

  16. Yan J, Risacher SL, Shen L, Saykin AJ (2018) Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Brief Bioinf 19(6):1370–1381

    CAS  Google Scholar 

  17. Occhipinti A, Hamadi Y, Kugler H et al (2020) Discovering essential multiple gene effects through large scale optimization: an application to human cancer metabolism. IEEE/ACM Trans Comput Biol Bioinf 18:2339

    Article  Google Scholar 

  18. Eyassu F, Angione C (2017) Modelling pyruvate dehydrogenase under hypoxia and its role in cancer metabolism. R Soc Open Sci 4(10):170360

    Article  PubMed  PubMed Central  Google Scholar 

  19. Zhao L, Dong Q, Luo C et al (2021) DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J 19:2719–2725

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Yaneske E, Zampieri G, Bertoldi L et al (2021) Genome-scale metabolic modelling of SARS-CoV-2 in cancer cells reveals an increased shift to glycolytic energy production. FEBS Lett 595(18):2350–2365

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Angione C (2018) Integrating splice-isoform expression into genome-scale models characterizes breast cancer metabolism. Bioinformatics 34(3):494–501

    Article  CAS  PubMed  Google Scholar 

  22. Anaya J, Reon B, Chen WM et al (2016) A pan-cancer analysis of prognostic genes. PeerJ 3:e1499

    Article  PubMed  PubMed Central  Google Scholar 

  23. Zhu B, Song N, Shen R et al (2017) Integrating clinical and multiple omics data for prognostic assessment across human cancers. Sci Rep 7(1):1–13

    Article  Google Scholar 

  24. Islam MM, Haque MR, Iqbal H et al (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):1–14

    Article  Google Scholar 

  25. Zampieri G, Vijayakumar S, Yaneske E, Angione C (2019) Machine and deep learning meet genome-scale metabolic modeling. PLoS Comput Biol 15(7):e1007084

    Article  PubMed  PubMed Central  Google Scholar 

  26. Alabi RO, Elmusrati M, Sawazaki-Calone I et al (2020) Comparison of supervised machine learning classification techniques in prediction of locoregional recurrences in early oral tongue cancer. Int J Med Informatics 136:104068

    Article  Google Scholar 

  27. Culley C, Vijayakumar S, Zampieri G, Angione C (2020) A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth. Proc Natl Acad Sci 117(31):18869–18879

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Chugh G, Kumar S, Singh N (2021) Survey on machine learning and deep learning applications in breast cancer diagnosis. Cogn Comput:1–20

    Google Scholar 

  29. Akram M, Iqbal M, Daniyal M, Khan AU (2017) Awareness and current knowledge of breast cancer. Biol Res 50(1):1–23

    Article  Google Scholar 

  30. Simmons CP, McMillan DC, McWilliams K et al (2017) Prognostic tools in patients with advanced cancer: a systematic review. J Pain Symptom Manag 53(5):962–970

    Article  Google Scholar 

  31. Ascolani G, Occhipinti A, Liò P (2015) Modelling circulating tumour cells for personalised survival prediction in metastatic breast cancer. PLoS Comput Biol 11(5):e1004199

    Article  PubMed  PubMed Central  Google Scholar 

  32. Wang P, Li Y, Reddy CK (2019) Machine learning for survival analysis: A survey. ACM Comput Surv (CSUR) 51(6):1–36

    Article  Google Scholar 

  33. Mariotto AB, Noone AM, Howlader N et al (2014) Cancer survival: an overview of measures, uses, and interpretation. J Natl Cancer Inst Monographs 2014(49):145–186

    Article  Google Scholar 

  34. Austin PC (2017) A tutorial on multilevel survival analysis: methods, models and applications. Int Stat Rev 85(2):185–203

    Article  PubMed  PubMed Central  Google Scholar 

  35. Iuliano A, Occhipinti A, Angelini C et al (2016) Cancer markers selection using network-based cox regression: a methodological and computational practice. Front Physiol 7:208

    Article  PubMed  PubMed Central  Google Scholar 

  36. Yang Y, Lu Q, Shao X et al (2018) Development of a three-gene prognostic signature for hepatitis b virus associated hepatocellular carcinoma based on integrated transcriptomic analysis. J Cancer 9(11):1989

    Article  PubMed  PubMed Central  Google Scholar 

  37. Kiebish MA, Cullen J, Mishra P et al (2020) Multi-omic serum biomarkers for prognosis of disease progression in prostate cancer. J Transl Med 18(1):1–10

    Article  Google Scholar 

  38. Hao J, Kim Y, Mallavarapu T et al (2019) Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med Genomics 12(10):1–13

    Google Scholar 

  39. Moncada-Torres A, van Maaren MC, Hendriks MP et al (2021) Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep 11(1):1–13

    Article  Google Scholar 

  40. Akai H, Yasaka K, Kunimatsu A et al (2018) Predicting prognosis of resected hepatocellular carcinoma by radiomics analysis with random survival forest. Diagn Interv imaging 99(10):643–651

    Article  CAS  PubMed  Google Scholar 

  41. Bibault JE, Chang DT, Xing L (2021) Development and validation of a model to predict survival in colorectal cancer using a gradient-boosted machine. Gut 70(5):884–889

    Article  CAS  PubMed  Google Scholar 

  42. Wang H, Zheng B, Yoon SW, Ko HS (2018) A support vector machine-based ensemble algorithm for breast cancer diagnosis. Eur J Oper Res 267(2):687–699

    Article  Google Scholar 

  43. Ching T, Zhu X, Garmire LX (2018) Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput Biol 14(4):e1006076

    Article  PubMed  PubMed Central  Google Scholar 

  44. Huang Z, Zhan X, Xiang S et al (2019) SALMON: survival analysis learning with multi-omics neural networks on breast cancer. Front Genet 10:166

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Cheon S, Agarwal A, Popovic M et al (2016) The accuracy of clinicians’ predictions of survival in advanced cancer: a review. Ann Palliat Med 5(1):22–29

    PubMed  Google Scholar 

  46. Pereira B, Chin SF, Rueda OM et al (2016) The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun 7(1):1–16. https://doi.org/10.1038/ncomms11479

    Google Scholar 

  47. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems, pp 4768–4777

    Google Scholar 

  48. Singh R, Mukhopadhyay K (2011) Survival analysis in clinical trials: Basics and must know areas. Perspect Clin Res 2(4):145

    Article  PubMed  PubMed Central  Google Scholar 

  49. Cox DR (1972) Regression models and life-tables. J R Stat Soc B (Methodol) 34(2):187–202

    Google Scholar 

  50. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Annals Appl Stat 2(3):841–860

    Article  Google Scholar 

  51. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  52. Azar AT, Elshazly HI, Hassanien AE, Elkorany AM (2014) A random forest classifier for lymph diseases. Comput Methods Programs Biomed 113(2):465–473

    Article  PubMed  Google Scholar 

  53. Qu Z, Li H, Wang Y et al (2020) Detection of electricity theft behavior based on improved synthetic minority oversampling technique and random forest classifier. Energies 13(8):2039

    Article  Google Scholar 

  54. Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546

    Article  PubMed  Google Scholar 

  55. Hothorn T, Bühlmann P, Dudoit S et al (2006) Survival ensembles. Biostatistics 7(3):355–373

    Article  PubMed  Google Scholar 

  56. Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobotics 7:21

    Article  Google Scholar 

  57. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  Google Scholar 

  58. Ridgeway G (1999) The state of boosting. Comput Sci Stat:172–181

    Google Scholar 

  59. Khan FM, Zubek VB (2008) Support vector regression for censored data (SVRC): a novel tool for survival analysis. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 863–868

    Google Scholar 

  60. Vapnik V (1999) The nature of statistical learning theory. Springer Science & Business Media

    Google Scholar 

  61. Pölsterl S, Navab N, Katouzian A (2015) Fast training of support vector machines for survival analysis. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 243–259

    Google Scholar 

  62. Leger S, Zwanenburg A, Pilz K et al (2017) A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling. Sci Rep 7(1):1–11

    Article  CAS  Google Scholar 

  63. Gárate-Escamila AK, El Hassani AH, Andrès E (2020) Classification models for heart disease prediction using feature selection and PCA. Informatics Med Unlocked 19:100330

    Article  Google Scholar 

  64. Ewees AA, Al-qaness MA, Abualigah L et al (2021) Boosting arithmetic optimization algorithm with genetic algorithm operators for feature selection: Case study on Cox proportional hazards model. Mathematics 9(18):2321

    Article  Google Scholar 

  65. Schemper M, Kaider A, Wakounig S, Heinze G (2013) Estimating the correlation of bivariate failure times under censoring. Stat Med 32(27):4781–4790

    Article  PubMed  Google Scholar 

  66. Su Z, Tang B, Liu Z, Qin Y (2015) Multi-fault diagnosis for rotating machinery based on orthogonal supervised linear local tangent space alignment and least square support vector machine. Neurocomputing 157:208–222

    Article  Google Scholar 

  67. Rodrigues D, Pereira LA, Nakamura RY et al (2014) A wrapper approach for feature selection based on Bat algorithm and optimum-path forest. Expert Syst Appl 41(5):2250–2258

    Article  Google Scholar 

  68. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  PubMed  Google Scholar 

  69. Curtis C, Shah SP, Chin SF et al (2012) The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403):346–352

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Pölsterl S (2020) scikit-survival: A library for time-to-event analysis built on top of scikit-learn. J Mach Learn Res 21(212):1–6

    Google Scholar 

  71. Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA

    Google Scholar 

  72. Kim B, Khanna R, Koyejo OO (2016) Examples are not enough, learn to criticize! Criticism for Interpretability. In: Advances in neural information processing systems, vol 29

    Google Scholar 

  73. Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DKW, Newman SF, Kim J, et al (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760

    Article  PubMed  PubMed Central  Google Scholar 

  74. Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinformatics 11(2):253–264

    Article  CAS  PubMed  Google Scholar 

  75. Fryett JJ, Inshaw J, Morris AP, Cordell HJ (2018) Comparison of methods for transcriptome imputation through application to two common complex diseases. Eur J Hum Genet 26(11):1658–1667

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Shahjaman M, Rahman MR, Islam T et al (2021) rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data. Comput Biol Med 138:104911

    Article  PubMed  Google Scholar 

  77. Park S, Shin B, Shim WS et al. (2019) Wx: a neural network-based feature selection algorithm for transcriptomic data. Sci Rep 9(1):1–9

    Google Scholar 

  78. Han Y, Huang L, Zhou F (2021) Zoo: Selecting transcriptomic and methylomic biomarkers by ensembling animal-inspired swarm intelligence feature selection algorithms. Genes 12(11):1814

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Iuliano A, Occhipinti A, Angelini C et al (2021) COSMONET: An R package for survival analysis using screening-network methods. Mathematics 9(24):3262

    Article  Google Scholar 

  80. Katzman JL, Shaham U, Cloninger A et al (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18(1):1–12

    Article  Google Scholar 

  81. Poirion OB, Jing Z, Chaudhary K et al (2021) DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med 13(1):1–15

    Article  Google Scholar 

Download references

Acknowledgements

AO and CA acknowledge the support of Earlier.org through their Research Grant “Application of computational models of breast cancer for early-detection personalised tests.” CA acknowledges the support of EPSRC and The Alan Turing Institute through their Turing Network Development Award, and the Children’s Liver Disease Foundation through their Research Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Annalisa Occhipinti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Doan, L.M.T., Angione, C., Occhipinti, A. (2023). Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer. In: Selvarajoo, K. (eds) Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology. Methods in Molecular Biology, vol 2553. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2617-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-2617-7_16

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-2616-0

  • Online ISBN: 978-1-0716-2617-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics