Abstract
Cancer is one of the most dreadful causes of destruction to mankind. Many bioinformatics investigators have applied Artificial Intelligence (AI)-based learning approaches with the aim to develop computationally efficient models for detection of cancerous conditions. Gene expression analysis has shown significant promise in predicting outcomes for several kinds of cancer. Nonetheless, limited sample sizes continue to be a barrier to developing strong and effective classifiers. Traditional supervised learning approaches are limited to labeled data. As a result, a substantial proportion of microarray data sets that lack appropriate follow-up information are ignored. Ability of AI-based deep learning strategies to perceive noteworthy features from intricate datasets exposes their significance. Artificial intelligence and machine learning techniques are making inroads into biological research and health care, including, crucially, cancer research and the healthcare sector, where its practical implications are immense. These include cancer detection and diagnosis, subtype categorization, cancer therapy optimization, and the identification of novel therapeutic pathways in pharmaceutical research. While massive data required to train machine learning models may already exist, capitalizing on this opportunity to realize the fullest potential of artificial intelligence in both cancer research and therapeutic spaces would require significant hurdles to be overcome. The growing requirement is to apply artificial intelligence while maintaining standards to revolutionize cancer diagnosis, prognosis, and treatment of cancer patients and drive biomedical research. In the present study, we put forward a review of the AI-based approaches employed in the most recent publications for cancer prognosis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Cancer is a cluster of diseases relating to abnormal cell growth which are likely to attack or spread to other parts of the body. These are malignant tumors and contrast with those that do not spread to other parts of the body, called benign tumors [1, 2]. Hence, malignancy is harmful, while benign cells are not categorized as harmful. The growing cancer cases point out the need for early analysis and prediction of cancer, which requires research studies. The classification of patients into cancerous or healthy class sets has steered the interest of many bioinformatics investigators. Distinct practices have been used on a large scale in cancer investigation studies to advance predictive models, resulting in efficient and precise decision-making. These models efficiently serve as great decision support systems in the medical arena. Artificial Intelligence is a more extensive term involving Machine learning (ML) and Deep Learning (DL). ML is a collection of systems within the broad class of predictive analytics [3], and these approaches use several numerical, probabilistic, and optimization techniques that let machines study and sense patterns from massive, noisy, or complex datasets. The applications of ML techniques in the medical arena, mainly when corresponding applications are subjected to detailed proteomic and genomic information, should be encouraged for cancer prognosis [4, 5].
This study emphasizes the application of AI-based prediction models used for predicting the early diagnosis, risk assessment, life expectancy, and probability of cancer recurrence. The motivation behind cancer studies is discussed in the next section, followed by the literature survey and discussion.
Motivation
The Cancer statistics in the world debate the motivation behind cancer research. The whole world is dealing with the curse of cancer deaths. This section states the records of deaths due to cancer in the world, followed by cancer deaths statistics.
-
(i)
Cancer deaths in world
As stated in a WHO report [6], in 2018 alone, around 9.6 million people died of it, and many new cases of cancer are diagnosed regularly. Lung cancer is the most common cancer globally and is responsible for 22% of total cancer deaths in the world. A significant concern worldwide is deaths caused by cancer, and hence this issue needs our emergent concern. Approximately 19.3 million new cases and 10 million deaths were recorded in 2020 alone. Moreover, in 2020, more than 1 million incident cases of stomach cancer occurred worldwide. Figure 1 demonstrates the top five cancers that are responsible for the majority of cancer deaths. The Cancer statistics in the world debate the motivation behind cancer research. The whole world is dealing with the curse of cancer deaths. This section states the records of deaths due to cancer globally, followed by cancer deaths statistics.
-
(ii)
Cancer deaths in India
As per the report update on August 2019 by National Registry Programme [7], cancer statistics account for around 0.7 million cancer deaths in India. Thus, the death stats rising ~ 41% in just 8 years (from 2010 to 2018) is enough to provoke the thought that we need to find the solution to this problem as early as possible [8]. The statistics of 2018 are depicted in Table 1.
More than 52% of males have died of cancer, while 47% of women are affected. This establishes that more males have been affected by cancer than women. The top five cancers that are majorly recognized for causing cancer deaths in India are listed in Table 2.
In men, leading cancers are oral cavity and lung cancers which account for one-fourth of all cancer deaths, followed by stomach and other cancers. Regarding females, breast cancer and oral cavity cancers alone account for one-fourth of all cancer death. Cancer has become a world issue that needs to be resolved lest we lose generations of people to this deadly issue [6].
Various new research studies [5, 9, 10] have been made in the field of cancer diagnosis, and several studies [4, 11] have contributed to the analysis of survival time and risk assessment of cancer. Thus, the investigators must find efficient techniques that can detect cancer at an early stage.
Significance and Contribution of Study
The prediction of prognosis at the moment of tumor identification is a crucial problem in clinical cancer research. Accurate outcome prediction can aid in the selection of the best treatment for each patient. For instance, if patients can be precisely designated to subgroup based on whether the disease would relapse within a certain amount of time after tumor resection,' adjuvant chemotherapy (CTX) could be given to patients with higher risk while patients with stable conditions may be spared this toxic treatment. Microarray-based gene expression profiling has shown significant promise in predicting outcomes for many kinds of cancer.
This study serves as a contribution in the field of cancer research as this study provides an overview of current literature. This study also highlights the limitations of previous studies thereby paving the way to improve the research. Our study highlights the importance of different AI techniques like decision trees, support vector classifiers and neural techniques due to their efficient performance in the earlier studies.
Organization of Paper
The organization of manuscript is done in a manner to provide easy of reading to the researchers. “Background Study” provides the highlights of background study that explains the importance of Artificial Intelligence-based techniques. “Artificial Intelligence for Cancer Detection” provides insights on the significant achievements of AI in cancer detection, also proposed the prediction modeling in AI framework. “Research Analysis” presents a research analysis of current literature. “Discussion” entails the discussion of the study. Lastly, article is concluded in the “Conclusion”.
Background Study
This section gives an overview of ML techniques and analyzes the latest research studies, and applies ML approaches in current studies.
Earlier, AI-based techniques have been often utilized to diagnose and detect cancer [7,8,9]. AI-based models are used to conclude whether a person undergoing the symptoms of a specific cancer is suffering from it or not.
AI-Based Learning Techniques
Machine learning techniques are classified into three broad categories. Figure 1 shows the categorization of machine learning approaches.
-
(i)
Supervised learning: This learning involves a known set of input data and known responses to the data and the model is trained to produce logical predictions as the response to new data. Supervised learning approach consists of decision tree techniques (CART) [12], Bayesian methods (Naïve Bayes [13] and variations thereof, Bayesian Model Averaging), Artificial Neural Networks [14], Instance-Based Learning, (K Nearest Neighbors), and Ensemble Methods [15] (Boosting, Bagging, Adaboost, Gradient Boosting Machines, Gradient Boosted Regression Trees, and Random Forest).
-
(ii)
Unsupervised learning: This class of learning generates a descriptive model. In this class, there is no target to learn. The prime aim of this approach is to explore the data and find some patterns that can help analyze the data instances. Unsupervised learning includes Clustering Methods [16] (K Means and Hierarchical Clustering) and Principal Components [17].
-
(iii)
Semi-supervised learning:
Current studies advocate that aggregation of unlabeled data and partially labeled data aids in upgrading the prediction results, i.e., improved accuracy [18]. This group of learning that considers both labeled and unlabeled data is stated as Semi-supervised Learning. This Learning is employed in the same areas as Supervised Learning but is specifically convenient when the labeling cost is too high to permit a fully labeled training procedure. Semi-supervised Learning methods include regression algorithms, such as Ordinary Least Squares Regression, Linear Regression, Logistic Regression [19], and Stepwise Regression. The classification of machine leaning into three sub-types is shown in (Fig. 2).
Data Gathering
The data gathering step is the most crucial as it regulates the quality and the reliability of the classification model. Figure 3 shows a variety of data used in the cancer diagnosis.
The types of knowledge used for diagnosing cancer are often gathered, employing information available within the hospital. Cancers are often diagnosed using image processing tools that will extract data from medical images like resonance Imaging (MRI), computerized tomography (CT) Scan [3], genetic expressions (mutations in genes) or microarray analysis [20], clinical data, demographic features [21], expert notes and other electronic health records [22]. The information collected is often used because the decisive parameter to model the classifier for determining the result of the patient (malignant or benign) [23]. The outcome of the ML model is that the patterns are extracted from the datasets. These are usually the classification results of a patient into cancerous and non-cancerous.
Artificial Intelligence for Cancer Detection
Data Cleaning is an essential part of making the model. This step is performed during the data preprocessing. Data Preprocessing imputes the missing values in the data, data normalization, data balancing, and feature optimization. Data imbalance usually exists in medical data. The imbalance ratio can be calculated using equation.
Preprocessing techniques like dimensionality reduction [17] eradicate extraneous features, diminish noise, reduce dimensionality. For analyzing the data, the quality of data should be addressed before analysis, like the existence of noise [11, 20], outliers, missing or redundant data [21], and partial data as it leads to degrading the quality of work. Machine learning procedures prediction model can highlight the patterns that affect the prediction results [22,23,24]. Data are divided into train and test sets. Figure 4 demonstrates the steps that are necessary for working the computational model. The outcome of an AI-based predictive model is the patterns that are extracted from the datasets. These are usually the classification results of a patient into cancerous and non-cancerous.
Classification models are trained on the former part, and then performance is evaluated on the latter using different performance evaluation parameters. To evaluate the performance of the classification models, confusion matrices are used to calculate different measures like the accuracy, precision, recall, Area Under the Curve, Mathew’s Correlation Coefficient, Precision [25]. The choice of evaluation parameter also highly depends on the features of the dataset [26].
Applications of AI in Current Studies
AI-based ML and DL methodologies have marked their significant contribution to cancer researches [11]. The amount of usage of these methods is depicted in Fig. 5.
During the literature survey, we concluded that neural techniques inspired by the mechanism of a human brain are most popular. Techniques like Probabilistic Neural Networks (PNN) [15], stacked sparse Auto-encoder (SSAE) [17], Artificial Neural Networks (ANN) [18] are most commonly known among cancer prediction studies. The optimization functions commonly used in Neural networks like sigmoid, tanh, and rectified linear unit (relu) are explained in the Eqs. (2, 3, 4).
SSAE works on the principle of computing squared error (\({\varvec{E}})\). This is methamatically explained in Eq. (5).
An important measure used in ANN to measure the prediction performance is logloss (given in Eq. (6)).
Neural Networks have been explored extensively in the literature, which is also evident from the research analysis tables presented in the section below that analyze the usage of these techniques in the current studies and highlight the significance of the neural techniques in cancer prediction studies. Figure 6 shows the structure of a neural network.
Research Analysis
Traditional supervised learning approaches are often employed for gene-expression data-based outcome prediction, in which only labeled data (i.e., data from samples with clinical follow-up) may be utilized for learning. In contrast, unlabeled data (i.e., data from samples without clinical follow-up) are ignored. Recent machine learning research suggests that using unlabeled data in conjunction with a minimal quantity of labeled data might result in a significant gain in learning accuracy, an approach known as semi-supervised learning. This subsection presents the research analysis made by the research studies done in the field of cancer detection. Table 3 presents an analysis of a few of the studies that use deep learning/neural networks and achieved praiseworthy prediction results.
Discussion
In the current review, recent AI-based studies related to diagnosis and the prognosis of cancer are reviewed that offer high accuracy. Cancer research uses prediction modeling; it generally counts on traditional supervised learning techniques, which take only labeled data into consideration for learning, disregarding unlabeled data. Labeled data refer to data from samples with clinical follow-up, and unlabeled data are from samples without clinical follow-up. The most common restriction observed in our study is the insignificant quantity of data instances. Excluding the size of data, the quality of the dataset along with the cautious dimension reduction techniques [37, 38] and data balancing [39, 40] approaches play a significant role in effective cancer prediction results.
The study [4] proposed model that has been tested on only a single lung cancer dataset. The model must be generalized on either cancers/datasets as well. The study [10] performed diagnosis of malignant mesothelioma, but authors have not validated the dataset as one attribute (same as target) is used for training the models. In study [18], the breast cancer recurrence prediction accurateness achieved by the proposed model is low and inconsiderable. The study [9] has used decision tress for predicting diabetes and the accuracy attained by the model is insignificant. The study [15] proposed a model based on kidney disease dataset has not dealt with imbalance nature of dataset. This study [23] has not addressed the issue of class imbalance on the mesothelioma dataset. The study [27] has not highlighted the features that are more significant cervical cancer risk factors. Further, two studies [28, 29] have proposed automated learning model for prediction of malignant mesothelioma, study [28] can be improved by incorporating better feature sets and study [29] has not dealt with data unevenness.
The dataset used in the study [30, 33] contains of 100 samples only, which is quite insignificant for validation of the models on prostate cancer dataset. The recent study [31] proposed novel model and predicted breast, mesothelioma, cervical cancer with appreciable accuracy but the study has not performed any feature selection technique on the cancer datasets. The model proposed in the study [32] has not been tuned or tested on other datasets. This study [34] has assessed only under-sampling technique, whereas the hybrid balancing techniques could have performed better on the imbalanced datasets. The study [35] has not performed feature selection for cancer diagnosis. Another recent study proposed a novel ensemble model for cancer prediction but the proposed model can be validated on more cancer datasets and hence can be generalized.
Conclusion
The present study debates the notions of AI-based approaches outlining their significance in cancer prediction/prognosis. The recent studies reviewed to emphasize the advancement of AI-based predictive models targeting to predict valid diagnosis results. It is concluded that creating more publicly available heterogeneous databases would facilitate the improvement in cancer prediction studies, and such practices can deliver more promising tools for interpretations in the cancer domain. Regarding future directions, more efficient preprocessing and other learning approaches need to be developed. Also, the creation of more publicly available databases ought to be considered [41]. Also, we aim to explore the significance of blockchain in healthcare, especially in cancer research [42].
References
Scheuner G, Mitzscherling CP, Pfister C, Pöge A, Seidler E. Functional morphology of the human placenta. Zentralbl Allg Pathol. 1989;135(4):307–28.
Lee KA, Chae J-l, Shim JH. Natural diterpenes from coffee, cafestol and kahweol induce apoptosis through regulation of specificity protein 1 expression in human malignant pleural mesothelioma. J Biomed Sci. 2012;19(1):1–10.
Levine AB, Schlosser C, Grewal J, Coope R, Jones SJM, Yip S. Rise of the machines : advances in deep learning for cancer diagnosis. Trends Cancer. 2019;5:157–69.
Chen Y, Ke W, Chiu H. Risk classi fi cation of cancer survival using ANN with gene expression data from multiple laboratories. Comput Biol Med. 2014;48:1–7.
Li M, Zhou ZH. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A. 2007;37(6):1088–98.
Ferlay J, Colombet M, Soerjomataram I, Mathers C, Parkin M, Piñeros M, Znaor A, Bray F. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int J Cancer. 2019;144(8):1941–53.
Report of National Cancer Registry Programme (ICMR-NCDIR), Bengaluru, India 2020. https://ncdirindia.org/All_Reports/PBCR_Annexures/Default.aspx.
Islami F, et al. Proportion and number of cancer cases and deaths attributable to potentially modifiable risk factors in the United States. CA Cancer J Clin. 2018;68(1):31–54.
Habibi S, Ahmadi M, Alizadeh S. Type 2 diabetes mellitus screening and risk factors using decision tree : results of data mining. Global J Health Sci. 2015;7(5):304–10.
Er O, Tanrikulu AC, Abakay A, Temurtas F. An approach based on probabilistic neural network for diagnosis of Mesothelioma’s disease. Comput Electr Eng. 2012;38(1):75–81.
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17.
Quinlan JR. Simplifying decision trees. Int J Hum Comput Stud. 1999;51(2):497–510.
Stigler SM. Thomas Bayes’s bayesian inference. J R Stat Soc Series A (Gen). 1982;145(2):250–8.
Cangelosi D, et al. Artificial neural network classifier predicts neuroblastoma patients’ outcome. BMC Bioinf. 2016. https://doi.org/10.1186/s12859-016-1194-3.
Potharaju SP, Sreedevi M. Ensembled rule based classification algorithms for predicting imbalanced kidney disease data. J Eng Sci Technol Rev. 2016;9(5):201–7.
Douzas G, Bacao F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci (Ny). 2018;465:1–20.
Groth D, Hartmann S, Klie S, Selbig J. Principal components analysis. Methods Mol Biol. 2013;930:527–47.
Shi M, Zhang B. Semi-supervised learning improves gene expression-based prediction of cancer recurrence. Bioinformatics. 2011;27(21):3017–23.
Lee MH, Liu Y. Kernel continuum regression. Comput Stat Data Anal. 2013;68:190–201.
Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Chicco D. Ten quick tips for machine learning in computational biology. BioData Mini. 2017. https://doi.org/10.1186/s13040-017-0155-3.
Qi Y, Zhao Z, Zhang L, Liu H, Lei K. A classification diagnosis of cervical cancer medical data based on various artificial neural networks. Int Conf Netw Commun Comput Eng (NCCE). 2018;147:579–82.
Er O, Abakay A. Use of artificial intelligence techniques for diagnosis of malignant pleural mesothelioma. Dicle Med J. 2015;42(1):5–11.
Saarela M, Ryynänen O, Äyrämö S. Artificial intelligence in medicine predicting hospital associated disability from imbalanced data using supervised learning. Artif Intell Med. 2019;95:88–95.
Bradley AE. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30(7):1145–59.
Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl. 2004;6(1):20.
Gupta S, Gupta MK (2021) Prostate cancer prognosis using multi-layer perceptron and class balancing techniques. In: 2021 thirteenth international conference on contemporary computing (IC3-2021) (IC3 '21), NY, USA. https://doi.org/10.1145/3474124.3474125.
Hu XUE, Yu Z. Diagnosis of mesothelioma with deep learning. Oncol Lett. 2019;17(2):1483–90.
Adem K, Kiliçarslan S, Cömert O. Classification and diagnosis of cervical cancer with stacked autoencoder and softmax classification. Expert Syst Appl. 2019;115:557–64.
Gupta S, Gupta MK. An approach based on neural learning for diagnosis of prostate cancer. JNR. 2020;21(3):110–8.https://doi.org/10.1111/coin.12452
Gupta, S., & Gupta, M. K. (2021). A comprehensive data‐level investigation of cancer diagnosis on imbalanced data. Computational Intelligence.
Gupta S, Gupta MK. Computational model for prediction of malignant mesothelioma diagnosis. Comput J. 2021. https://doi.org/10.1093/comjnl/bxab146.
Gupta S, Gupta MK. Deep learning for brain tumor segmentation using magnetic resonance images. IEEE Conf Comput Intell Bioinf Comput Biol (CIBCB). 2021. https://doi.org/10.1109/CIBCB49929.2021.9562890.
Chicco D, Rovelli C. Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PLoS ONE. 2019;14(1):1–28.
Mathur R, Pathak V, Bandil D. Emerging trends in expert applications and security. 841st ed. Singapore: Springer; 2019.
Gupta S, Gupta MK. Computational prediction of cervical cancer diagnosis using ensemble-based classification algorithm. Comput J. 2021. https://doi.org/10.1093/comjnl/bxaa198.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015. https://doi.org/10.1155/2015/198363.
Rekha G, Tyagi AK, Reddy VK. A wide scale classification of class imbalance problem and its solutions : a systematic literature review. J Comput Sci. 2019. https://doi.org/10.3844/jcssp.2019.886.929.
Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015;48(5):1623–37.
Fotouhi S, Asadi S, Kattan MW. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform. 2019;90:103089.
Kumar Y, Gupta S, Singla R, Hu Y. A systematic review of artificial intelligence techniques in cancer prediction and diagnosis. Arch Comput Methods Eng. 2021. https://doi.org/10.1007/978-3-030-31672-3_1.
Kumar Y, Sood K, Kaul S, Vasuja R. Big data analytics and its benefits in healthcare. In: Big data analytics in healthcare. Cham: Springer; 2020. p. 3–21.
Funding
This study has not received any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Computational Statistics” guest edited by Anish Gupta, Mike Hinchey, Vincenzo Puri, Zeev Zalevsky and Wan Abdul Rahim.
Rights and permissions
About this article
Cite this article
Gupta, S., Kumar, Y. Cancer Prognosis Using Artificial Intelligence-Based Techniques. SN COMPUT. SCI. 3, 77 (2022). https://doi.org/10.1007/s42979-021-00964-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00964-3