Introduction

Cancer is a cluster of diseases relating to abnormal cell growth which are likely to attack or spread to other parts of the body. These are malignant tumors and contrast with those that do not spread to other parts of the body, called benign tumors [1, 2]. Hence, malignancy is harmful, while benign cells are not categorized as harmful. The growing cancer cases point out the need for early analysis and prediction of cancer, which requires research studies. The classification of patients into cancerous or healthy class sets has steered the interest of many bioinformatics investigators. Distinct practices have been used on a large scale in cancer investigation studies to advance predictive models, resulting in efficient and precise decision-making. These models efficiently serve as great decision support systems in the medical arena. Artificial Intelligence is a more extensive term involving Machine learning (ML) and Deep Learning (DL). ML is a collection of systems within the broad class of predictive analytics [3], and these approaches use several numerical, probabilistic, and optimization techniques that let machines study and sense patterns from massive, noisy, or complex datasets. The applications of ML techniques in the medical arena, mainly when corresponding applications are subjected to detailed proteomic and genomic information, should be encouraged for cancer prognosis [4, 5].

This study emphasizes the application of AI-based prediction models used for predicting the early diagnosis, risk assessment, life expectancy, and probability of cancer recurrence. The motivation behind cancer studies is discussed in the next section, followed by the literature survey and discussion.

Motivation

The Cancer statistics in the world debate the motivation behind cancer research. The whole world is dealing with the curse of cancer deaths. This section states the records of deaths due to cancer in the world, followed by cancer deaths statistics.

  1. (i)

    Cancer deaths in world

As stated in a WHO report [6], in 2018 alone, around 9.6 million people died of it, and many new cases of cancer are diagnosed regularly. Lung cancer is the most common cancer globally and is responsible for 22% of total cancer deaths in the world. A significant concern worldwide is deaths caused by cancer, and hence this issue needs our emergent concern. Approximately 19.3 million new cases and 10 million deaths were recorded in 2020 alone. Moreover, in 2020, more than 1 million incident cases of stomach cancer occurred worldwide. Figure 1 demonstrates the top five cancers that are responsible for the majority of cancer deaths. The Cancer statistics in the world debate the motivation behind cancer research. The whole world is dealing with the curse of cancer deaths. This section states the records of deaths due to cancer globally, followed by cancer deaths statistics.

  1. (ii)

    Cancer deaths in India

Fig. 1
figure 1

Most common types of cancer

As per the report update on August 2019 by National Registry Programme [7], cancer statistics account for around 0.7 million cancer deaths in India. Thus, the death stats rising ~ 41% in just 8 years (from 2010 to 2018) is enough to provoke the thought that we need to find the solution to this problem as early as possible [8]. The statistics of 2018 are depicted in Table 1.

Table 1 Cancer statistics

More than 52% of males have died of cancer, while 47% of women are affected. This establishes that more males have been affected by cancer than women. The top five cancers that are majorly recognized for causing cancer deaths in India are listed in Table 2.

Table 2 Cancers in India

In men, leading cancers are oral cavity and lung cancers which account for one-fourth of all cancer deaths, followed by stomach and other cancers. Regarding females, breast cancer and oral cavity cancers alone account for one-fourth of all cancer death. Cancer has become a world issue that needs to be resolved lest we lose generations of people to this deadly issue [6].

Various new research studies [5, 9, 10] have been made in the field of cancer diagnosis, and several studies [4, 11] have contributed to the analysis of survival time and risk assessment of cancer. Thus, the investigators must find efficient techniques that can detect cancer at an early stage.

Significance and Contribution of Study

The prediction of prognosis at the moment of tumor identification is a crucial problem in clinical cancer research. Accurate outcome prediction can aid in the selection of the best treatment for each patient. For instance, if patients can be precisely designated to subgroup based on whether the disease would relapse within a certain amount of time after tumor resection,' adjuvant chemotherapy (CTX) could be given to patients with higher risk while patients with stable conditions may be spared this toxic treatment. Microarray-based gene expression profiling has shown significant promise in predicting outcomes for many kinds of cancer.

This study serves as a contribution in the field of cancer research as this study provides an overview of current literature. This study also highlights the limitations of previous studies thereby paving the way to improve the research. Our study highlights the importance of different AI techniques like decision trees, support vector classifiers and neural techniques due to their efficient performance in the earlier studies.

Organization of Paper

The organization of manuscript is done in a manner to provide easy of reading to the researchers. “Background Study” provides the highlights of background study that explains the importance of Artificial Intelligence-based techniques. “Artificial Intelligence for Cancer Detection” provides insights on the significant achievements of AI in cancer detection, also proposed the prediction modeling in AI framework. “Research Analysis” presents a research analysis of current literature. “Discussion” entails the discussion of the study. Lastly, article is concluded in the “Conclusion”.

Background Study

This section gives an overview of ML techniques and analyzes the latest research studies, and applies ML approaches in current studies.

Earlier, AI-based techniques have been often utilized to diagnose and detect cancer [7,8,9]. AI-based models are used to conclude whether a person undergoing the symptoms of a specific cancer is suffering from it or not.

AI-Based Learning Techniques

Machine learning techniques are classified into three broad categories. Figure 1 shows the categorization of machine learning approaches.

  1. (i)

    Supervised learning: This learning involves a known set of input data and known responses to the data and the model is trained to produce logical predictions as the response to new data. Supervised learning approach consists of decision tree techniques (CART) [12], Bayesian methods (Naïve Bayes [13] and variations thereof, Bayesian Model Averaging), Artificial Neural Networks [14], Instance-Based Learning, (K Nearest Neighbors), and Ensemble Methods [15] (Boosting, Bagging, Adaboost, Gradient Boosting Machines, Gradient Boosted Regression Trees, and Random Forest).

  2. (ii)

    Unsupervised learning: This class of learning generates a descriptive model. In this class, there is no target to learn. The prime aim of this approach is to explore the data and find some patterns that can help analyze the data instances. Unsupervised learning includes Clustering Methods [16] (K Means and Hierarchical Clustering) and Principal Components [17].

  3. (iii)

    Semi-supervised learning:

Current studies advocate that aggregation of unlabeled data and partially labeled data aids in upgrading the prediction results, i.e., improved accuracy [18]. This group of learning that considers both labeled and unlabeled data is stated as Semi-supervised Learning. This Learning is employed in the same areas as Supervised Learning but is specifically convenient when the labeling cost is too high to permit a fully labeled training procedure. Semi-supervised Learning methods include regression algorithms, such as Ordinary Least Squares Regression, Linear Regression, Logistic Regression [19], and Stepwise Regression. The classification of machine leaning into three sub-types is shown in (Fig. 2).

Fig. 2
figure 2

Classification of machine learning techniques

Data Gathering

The data gathering step is the most crucial as it regulates the quality and the reliability of the classification model. Figure 3 shows a variety of data used in the cancer diagnosis.

Fig. 3
figure 3

Type of data used in cancer studies

The types of knowledge used for diagnosing cancer are often gathered, employing information available within the hospital. Cancers are often diagnosed using image processing tools that will extract data from medical images like resonance Imaging (MRI), computerized tomography (CT) Scan [3], genetic expressions (mutations in genes) or microarray analysis [20], clinical data, demographic features [21], expert notes and other electronic health records [22]. The information collected is often used because the decisive parameter to model the classifier for determining the result of the patient (malignant or benign) [23]. The outcome of the ML model is that the patterns are extracted from the datasets. These are usually the classification results of a patient into cancerous and non-cancerous.

Artificial Intelligence for Cancer Detection

Data Cleaning is an essential part of making the model. This step is performed during the data preprocessing. Data Preprocessing imputes the missing values in the data, data normalization, data balancing, and feature optimization. Data imbalance usually exists in medical data. The imbalance ratio can be calculated using equation.

$$I = { }\frac{{M_{i} }}{{M_{a} }}$$
(1)
$$\begin{gathered} {\text{where}}\, \user2{ }M_{i} = {\text{size of minority}}. \hfill \\ M_{a} = \,{\text{size}}\,{\text{ of}}\,{\text{ majority}}. \hfill \\ \end{gathered}$$

Preprocessing techniques like dimensionality reduction [17] eradicate extraneous features, diminish noise, reduce dimensionality. For analyzing the data, the quality of data should be addressed before analysis, like the existence of noise [11, 20], outliers, missing or redundant data [21], and partial data as it leads to degrading the quality of work. Machine learning procedures prediction model can highlight the patterns that affect the prediction results [22,23,24]. Data are divided into train and test sets. Figure 4 demonstrates the steps that are necessary for working the computational model. The outcome of an AI-based predictive model is the patterns that are extracted from the datasets. These are usually the classification results of a patient into cancerous and non-cancerous.

Fig. 4
figure 4

Flowchart of the methodology used for cancer detection

Classification models are trained on the former part, and then performance is evaluated on the latter using different performance evaluation parameters. To evaluate the performance of the classification models, confusion matrices are used to calculate different measures like the accuracy, precision, recall, Area Under the Curve, Mathew’s Correlation Coefficient, Precision [25]. The choice of evaluation parameter also highly depends on the features of the dataset [26].

Applications of AI in Current Studies

AI-based ML and DL methodologies have marked their significant contribution to cancer researches [11]. The amount of usage of these methods is depicted in Fig. 5.

Fig. 5
figure 5

AI techniques used in cancer studies

During the literature survey, we concluded that neural techniques inspired by the mechanism of a human brain are most popular. Techniques like Probabilistic Neural Networks (PNN) [15], stacked sparse Auto-encoder (SSAE) [17], Artificial Neural Networks (ANN) [18] are most commonly known among cancer prediction studies. The optimization functions commonly used in Neural networks like sigmoid, tanh, and rectified linear unit (relu) are explained in the Eqs. (2, 3, 4).

$${\text{Sigmoid}}\left( z \right) = \frac{1}{{1 + e^{ - z} }}$$
(2)
$$\tanh \left( z \right) = \frac{2}{{1 + e^{ - 2z} }} - 1$$
(3)
$$relu\left( z \right) = \max \left( {0,z} \right)$$
(4)

SSAE works on the principle of computing squared error (\({\varvec{E}})\). This is methamatically explained in Eq. (5).

$$E = {\text{MSE}} + \left( {\gamma \times L2{\text{Regularization}}\,{\text{Term}}} \right) + \left( {\delta \times {\text{Sparsity}}\,{\text{Regularization}}\,{\text{Term}}} \right)$$
(5)
$$\begin{gathered} {\text{where}}\,\,\gamma = {\text{The}}\,{\text{ coefficient}}\,{\text{ for}}\,{\text{ the}}\, L2\,{\text{ regularization}}\,{\text{ term}}. \hfill \\ \delta = {\text{The}}\,{\text{ coefficient}}\,{\text{ for}}\,{\text{ the}}\,{\text{ sparsity}}\, {\text{regularization}}\,{\text{ term}}. \hfill \\ {\text{MSE}} = {\text{Mean}}\,{\text{ Squared}}\,{\text{ Error}} \hfill \\ \end{gathered}$$

An important measure used in ANN to measure the prediction performance is logloss (given in Eq. (6)).

$${\text{Logloss}} = \sum S\log \left( {S^{i} } \right) + \left( {1 - S} \right) \left( {\log \left( {1 - S^{i} } \right)} \right)$$
(6)
$$\begin{gathered} {\text{where}}\,\,S = \,{\text{Vector}}\,{\text{ of}}\,{\text{ actual}}\,{\text{ Values}} \hfill \\ S^{i} \, = \,{\text{ Vector}}\,{\text{ of}}\,{\text{ predicted}}\,{\text{ Values}} \hfill \\ \end{gathered}$$

Neural Networks have been explored extensively in the literature, which is also evident from the research analysis tables presented in the section below that analyze the usage of these techniques in the current studies and highlight the significance of the neural techniques in cancer prediction studies. Figure 6 shows the structure of a neural network.

Fig. 6
figure 6

Structure of neural network

Research Analysis

Traditional supervised learning approaches are often employed for gene-expression data-based outcome prediction, in which only labeled data (i.e., data from samples with clinical follow-up) may be utilized for learning. In contrast, unlabeled data (i.e., data from samples without clinical follow-up) are ignored. Recent machine learning research suggests that using unlabeled data in conjunction with a minimal quantity of labeled data might result in a significant gain in learning accuracy, an approach known as semi-supervised learning. This subsection presents the research analysis made by the research studies done in the field of cancer detection. Table 3 presents an analysis of a few of the studies that use deep learning/neural networks and achieved praiseworthy prediction results.

Table 3 Analysis of studies using neural techniques

Discussion

In the current review, recent AI-based studies related to diagnosis and the prognosis of cancer are reviewed that offer high accuracy. Cancer research uses prediction modeling; it generally counts on traditional supervised learning techniques, which take only labeled data into consideration for learning, disregarding unlabeled data. Labeled data refer to data from samples with clinical follow-up, and unlabeled data are from samples without clinical follow-up. The most common restriction observed in our study is the insignificant quantity of data instances. Excluding the size of data, the quality of the dataset along with the cautious dimension reduction techniques [37, 38] and data balancing [39, 40] approaches play a significant role in effective cancer prediction results.

The study [4] proposed model that has been tested on only a single lung cancer dataset. The model must be generalized on either cancers/datasets as well. The study [10] performed diagnosis of malignant mesothelioma, but authors have not validated the dataset as one attribute (same as target) is used for training the models. In study [18], the breast cancer recurrence prediction accurateness achieved by the proposed model is low and inconsiderable. The study [9] has used decision tress for predicting diabetes and the accuracy attained by the model is insignificant. The study [15] proposed a model based on kidney disease dataset has not dealt with imbalance nature of dataset. This study [23] has not addressed the issue of class imbalance on the mesothelioma dataset. The study [27] has not highlighted the features that are more significant cervical cancer risk factors. Further, two studies [28, 29] have proposed automated learning model for prediction of malignant mesothelioma, study [28] can be improved by incorporating better feature sets and study [29] has not dealt with data unevenness.

The dataset used in the study [30, 33] contains of 100 samples only, which is quite insignificant for validation of the models on prostate cancer dataset. The recent study [31] proposed novel model and predicted breast, mesothelioma, cervical cancer with appreciable accuracy but the study has not performed any feature selection technique on the cancer datasets. The model proposed in the study [32] has not been tuned or tested on other datasets. This study [34] has assessed only under-sampling technique, whereas the hybrid balancing techniques could have performed better on the imbalanced datasets. The study [35] has not performed feature selection for cancer diagnosis. Another recent study proposed a novel ensemble model for cancer prediction but the proposed model can be validated on more cancer datasets and hence can be generalized.

Conclusion

The present study debates the notions of AI-based approaches outlining their significance in cancer prediction/prognosis. The recent studies reviewed to emphasize the advancement of AI-based predictive models targeting to predict valid diagnosis results. It is concluded that creating more publicly available heterogeneous databases would facilitate the improvement in cancer prediction studies, and such practices can deliver more promising tools for interpretations in the cancer domain. Regarding future directions, more efficient preprocessing and other learning approaches need to be developed. Also, the creation of more publicly available databases ought to be considered [41]. Also, we aim to explore the significance of blockchain in healthcare, especially in cancer research [42].