Abstract
Objective
Breast cancer (BC) is a multifactorial disease and is one of the most common cancers globally. This study aimed to compare different machine learning (ML) techniques to develop a comprehensive breast cancer risk prediction model based on features of various factors.
Methods
The population sample contained 810 records (115 cancer patients and 695 healthy individuals). 45 attributes out of 85 were selected based on the opinion of experts. These selected attributes are in genetic, biochemical, biomarker, gender, demographic and pathological factors. 13 Machine learning models were trained with proposed attributes and coefficient of attributes and internal relationships were calculated.
Result
Compared to other methods random forest (RF) has higher performance (accuracy 99.26%, precision 99%, and area under the curve (AUC) 99%). The results of assessing the impact and correlation of variables using the RF method based on PCA indicated that pathology, biomarker, biochemistry, gene, and demographic factors with a coefficient of 0.35, 0.23, 0.15, 0.14, and 0.13 respectively, affected the risk of BC (r2 = 0.54).
Conclusion
Breast cancer has several risk factors. Medical experts use these risk factors for early diagnosis. Therefore, identifying related risk factors and their effect can increase the accuracy of diagnosis. Considering the broad features for predicting breast cancer leads to the development of a comprehensive prediction model. In this study, using RF technique a breast cancer prediction model with 99.3% accuracy was developed based on multifactorial features.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdel-Zaher AM, Eldeib AM (2016) Breast cancer classification using deep belief networks. Expert Syst Appl 46:139–144
Akbari A et al (2011) Parity and breastfeeding are preventive measures against breast cancer in Iranian women. Breast Cancer 18:51–55
Antoniou AC, Easton D (2006) Models of genetic susceptibility to breast cancer. Oncogene 25:5898–5905
Arthur RS, Xue X, Rohan TE (2020) Prediagnostic circulating levels of sex steroid hormones and SHBG in relation to risk of ductal carcinoma in situ of the breast among UK women. Cancer Epidemiol Prev Biomark 29:1058–1066
Awaysheh A et al (2019) Review of medical decision support and machine-learning methods. Vet Pathol 56:512–525
Bazila-Banu A, Thirumalaikolundusubramanian P (2018) Comparison of Bayes classifiers for breast cancer classification. Asian Pac J Cancer Prev: APJCP 19:2917
Bharati S, Rahman MA, Podder P (2018) In: 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE. pp 581–584
Boeri C et al (2020) Machine Learning techniques in breast cancer prognosis prediction: a primary evaluation. Cancer Med 9:3234–3243
Borges C, Almeida D, Damasceno M (2020) Prognostic and predictive factors for primary chemotherapy in locally advanced breast cancer. medRxiv
Brewer HR, Jones ME, Schoemaker MJ, Ashworth A, Swerdlow AJ (2017) Family history and risk of breast cancer: an analysis accounting for family structure. Breast Cancer Res Treat 165:193–200
Calle ML, Urrea V, Boulesteix A-L, Malats N (2011) AUC-RF: a new strategy for genomic profiling with random forest. Hum Hered 72:121–132
Chandrasekar R, Palaniammal V, Phil M (2013) Performance and evaluation of data mining techniques in cancer diagnosis. IOSR J Comput Eng (IOSR-JCE) 15:39–44
Chen X, Ishwaran H (2012) Random forests for genomic data analysis. Genomics 99:323–329
Chen X, Wang M, Zhang H (2011) The use of classification trees for bioinformatics. Wiley Interdiscip Rev: Data Min Knowl Discov 1:55–63
Chen W et al (2013) Risk of GWAS-identified genetic variants for breast cancer in a Chinese population: a multiple interaction analysis. Breast Cancer Res Treat 142:637–644
Chen L et al (2020) Local extraction and detection of early stage breast cancers through a microneedle and nano-Ag/MBL film based painless and blood-free strategy. Mater Sci Eng, C 109:110402
Chidambaranathan S (2016) Breast cancer diagnosis based on feature extraction by hybrid of k-means and extreme learning machine algorithms. ARPN J Eng Appl Sci 11:4581–4586
Chu SY et al (1991) The relationship between body mass and breast cancer among women enrolled in the cancer and steroid hormone study. J Clin Epidemiol 44:1197–1206
Dorani F, Hu T, Woods MO, Zhai G (2018) Ensemble learning for detecting gene-gene interactions in colorectal cancer. PeerJ 6:e5854
Eltalhi S, Kutrani H (2019) Breast cancer diagnosis and prediction using machine learning and data mining techniques: a review. IOSR J Dental Med Sci 18(4):85–94
Emerson M (2019) Race, age and treatment delay in the Carolina breast cancer study phase 3
Fabris VT (2014) From chromosomal abnormalities to the identification of target genes in mouse models of breast cancer. Cancer Genet 207:233–246
Ferguson NL et al (2013) Prognostic value of breast cancer subtypes, Ki-67 proliferation index, age, and pathologic tumor characteristics on breast cancer survival in Caucasian women. Breast J 19:22–30
Ferroni P et al (2019) Breast cancer prognosis using a machine learning approach. Cancers 11:328
Ganggayah MD, Taib NA, Har YC, Lio P, Dhillon SK (2019) Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med Inform Decis Mak 19:48
Garcia M et al (2007) Global Cancer Facts & Figures 2007. Atlanta, GA: American Cancer Society
Getachew S et al (2020) Perceived barriers to early diagnosis of breast cancer in south and southwestern Ethiopia: a qualitative study. BMC Womens Health 20:1–8
Giger ML (2000) Computer-aided diagnosis in mammography. Handb Med Imaging 2:915–1004
Hadizadeh M et al (2018) GJA4/Connexin 37 mutations correlate with secondary lymphedema following surgery in breast cancer patients. Biomedicines 6:23
Hayes SC, Janda M, Cornish B, Battistutta D, Newman B (2008) Lymphedema after breast cancer: incidence, risk factors, and effect on upper body function. J Clin Oncol 26:3536–3542
Hesari A et al (2019) Evaluation of the two polymorphisms rs1801133 in MTHFR and rs10811661 in CDKN2A/B in breast cancer. J Cell Biochem 120:2090–2097
Ho PJ et al (2020) Incidence of breast cancer attributable to breast density, modifiable and non-modifiable breast cancer risk factors in Singapore. Sci Rep 10:1–11
Kim W et al (2012) Development of novel breast cancer recurrence prediction model using support vector machine. J Breast Cancer 15:230–238
Knai C et al (2012) Systematic review of the methodological quality of clinical guideline development for the management of chronic disease in Europe. Health Policy 107:157–167
Kobayashi H, Takahashi H, Kimura T, Kikuchi K, Tazaki M (2000) In: 2000 26th annual conference of the IEEE industrial electronics society. IECON 2000. 2000 ieee international conference on industrial electronics, control and instrumentation. 21st century technologies. IEEE, pp. 487–492
Kontzoglou K et al (2013) Correlation between Ki67 and breast cancer prognosis. Oncology 84:219–225
Kordík P, Černý J, Frýda T (2018) Discovering predictive ensembles for transfer learning and meta-learning. Mach Learn 107:177–207
Lavanya D, Rani KU (2012) Ensemble decision tree classifier for breast cancer data. Int J Inf Technol Converg Serv 2:17
Liang M et al (2018) Association between CHEK2* 1100delC and breast cancer: a systematic review and meta-analysis. Mol Diagn Ther 22:397–407
Liu K-H, Tong M, Xie S-T, Yee Ng VT (2015) Genetic programming based ensemble system for microarray data classification. Comput Math Methods Med. https://doi.org/10.1155/2015/193406
Lotfi M, Charkhati S, Shobeyri S (2008) Breast cancer risk factors in an urban area of Yazd city, Iran
Ma R, Huang D, Zhang T, Luo T (2018) Determining influential descriptors for polymer chain conformation based on empirical force-fields and molecular dynamics simulations. Chem Phys Lett 704:49–54
Majali J, Niranjan R, Phatak V, Tadakhe O (2015) Data mining techniques for diagnosis and prognosis of cancer. Int J Adv Res Comput Commun Eng 4:613–616
Martin A-M, Weber BL (2000) Genetic and hormonal risk factors in breast cancer. J Natl Cancer Inst 92:1126–1135
Menze BH et al (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10:213
Moore P, Lyons T, Gallacher J, Initiative AsDN (2019) Random forest prediction of Alzheimer’s disease using pairwise selection from time series data. PLoS ONE 14:e0211558
Mubarik S et al (2020) A Hierarchical age–period–cohort analysis of breast cancer mortality and disability adjusted life years (1990–2015) attributable to modified risk factors among Chinese women. Int J Environ Res Public Health 17:1367
Mushtaq Z, Yaqub A, Sani S, Khalid A (2020) Effective K-nearest neighbor classifications for Wisconsin breast cancer data sets. J Chin Inst Eng 43:80–92
Nazari E, Ameli E, Tabesh H (2019a) Big data in healthcare: A to Z. J Biostat Epidemiol 5(3):194–203
Nazari E, Afkanpour M, Tabesh H (2019b) Big data from A to Z. Front Health Inform 8:20
Nazari E et al (2020a) Deep learning for acute myeloid leukemia diagnosis. J Med Life 13:382
Nazari E et al (2020b) A comprehensive overview of decision fusion technique in healthcare: a systematic scoping review. Iran Red Crescent Med J 22(10):e30
Nguyen C, Wang Y, Nguyen HN (2013) Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. J Biomed Sci Eng 6(3):551–560
Okun O, Priisalu H (2007) Iberian conference on pattern recognition and image analysis. Springer, pp. 483–490
Ozcift A (2012) SVM feature selection based rotation forest ensemble classifiers to improve computer-aided diagnosis of Parkinson disease. J Med Syst 36:2141–2147
Polat K, Güneş S (2007) Breast cancer diagnosis using least square support vector machine. Digit Signal Process 17:694–701
Pujol P, Galtier-Dereure F, Bringer J (1997) Obesity and breast cancer risk. Hum Reprod 12:116–125
Qi Y (2012) Ensemble machine learning. Springer, New York, pp 307–323
Radhakrishnan A, Madhav ML (2016) A survey on efficient broadcast protocol for the Internet of Things. IJECS 5:18838–18842
Reddington R et al (2020) Incidence of male breast cancer in Scotland over a twenty-five-year period (1992–2017). Eur J Surg Oncol 46(6):e51
Sarica A, Cerasa A, Quattrone A (2017) Random Forest algorithm for the classification of neuroimaging data in Alzheimer’s disease: a systematic review. Front Aging Neurosci 9:329
Sartor H et al (2020) The association of single nucleotide polymorphisms (SNPs) with breast density and breast cancer survival: the Malmö diet and cancer study. Acta Radiol 61(10):1326–1334
Saslow D et al (2007) American Cancer Society guidelines for breast screening with MRI as an adjunct to mammography. CA: Cancer J Clin 57:75–89
Seifi S et al (2020) Association of cyclin-dependent kinase inhibitor 2A/B with increased risk of developing breast cancer. J Cell Physiol 235:5141–5145
Semin JN, Palm D, Smith LM, Ruttle S (2020) Understanding breast cancer survivors’ financial burden and distress after financial assistance. Support Care Cancer 28(9):4241–4248
Setiono R (2000) Generating concise and accurate classification rules for breast cancer diagnosis. Artif Intell Med 18:205–219
ShahidSales S et al (2018) A genetic variant in CDKN2A/B gene is associated with the increased risk of breast cancer. J Clin Lab Anal 32:e22190
Sheikhtaheri A, Sadoughi F, Dehaghi ZH (2014) Developing and using expert systems and neural networks in medicine: a review on benefits and challenges. J Med Syst 38:110
Shen T-C et al (2017) Patients with uterine leiomyoma exhibit a high incidence but low mortality rate for breast cancer. Oncotarget 8:33014
Smith-Warner SA et al (1998) Alcohol and breast cancer in women: a pooled analysis of cohort studies. JAMA 279:535–540
Sumbaly R, Vishnusri N, Jeyalatha S (2014) Diagnosis of breast cancer using decision tree data mining technique. Int J Comput Appl 98(10):16–24
Takalkar U et al (2020) Hormone related risk factors and breast cancer: hospital based case control study from India. Breast Cancer. https://doi.org/10.5171/2014.872124
Tan AC, Gilbert D (2003) Ensemble machine learning on gene expression data for cancer classification. Appl Bioinform 2(3 Suppl):S75-83
Tarek A, El-Ghonaimy EA, Abdelaziz S, El-Shinawi M, Mohamed MM (2020) Characterization of the surgical leakage collected after breast cancer surgery and studying their effect on breast cancer cell line. Egypt Acad J Biol Sci, D Histol Histochem 12:21–29
Tourassi GD, Markey MK, Lo JY, Floyd CE Jr (2001) A neural network approach to breast cancer diagnosis as a constraint satisfaction problem. Med Phys 28:804–811
Übeyli ED (2007) Implementing automated diagnostic systems for breast cancer detection. Expert Syst Appl 33:1054–1062
Wang H et al (2020) Competitive electrochemical aptasensor based on a cDNA-ferrocene/MXene probe for detection of breast cancer marker Mucin1. Anal Chim Acta 1094:18–25
Yue W et al (2010) Effects of estrogen on breast cancer development: role of estrogen receptor independent mechanisms. Int J Cancer 127:1748–1757
Yue W, Wang Z, Chen H, Payne A, Liu X (2018) Machine learning with applications in breast cancer diagnosis and prognosis. Designs 2:13
Zakariah M (2014) Classification of genome data using random forest algorithm. Int J Comput Techno Appl 5(5):1663–1669
Zand HKK (2015) A comparative survey on data mining techniques for breast cancer diagnosis and prediction. Indian J Fundam Appl Life Sci 5:4330–4339
Zeliha KP et al (2020) Association between ABCB1, ABCG2 carrier protein and COX-2 enzyme gene polymorphisms and breast cancer risk in a Turkish population. Saudi Pharm J 28:215–219
Funding
This study was funded by Mashhad University of Medical Sciences (grant number 960336, 960275, 960211, 951122, 940724, 961731).
Author information
Authors and Affiliations
Contributions
Study conception and design: EN, MT, AA, MK, GAF, HT Acquisition of data: EN Analysis and interpretation of data: EN, HN, MT, RA, AHF Drafting of manuscript: EN, HN, MD, AM Critical revision: AA, GAF.
Corresponding authors
Ethics declarations
Conflict of interest
All authors have no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were following the ethical standards of Mashhad University of Medical Sciences and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nazari, E., Naderi, H., Tabadkani, M. et al. Breast cancer prediction using different machine learning methods applying multi factors. J Cancer Res Clin Oncol 149, 17133–17146 (2023). https://doi.org/10.1007/s00432-023-05388-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00432-023-05388-5