1 Introduction

Cancer is a deadly disease that caused about 10 million deaths worldwide in 2020 and will have the greatest relative increase by 2040 [1]. In the United States, one in ten adults has been infected with cancer disease [2]. These figures are not different for India, having the third rank in cancer cases around the globe [3]. The death rate due to cancer in India has been increasing rapidly since 1990 [4]. These figures are truly astonishing and scary. Usually, Cancer occurs due to the abnormal growth of cells [5]and metastasizes to various parts of the body through blood or lymph vessels [6]. Cancer cells acquire the needed space and nutrients of the healthy organ which then may lead to organ failure and can become a cause of death. Over the past decades, cancer research has been directed toward detecting cancer at its initial stages. The researcher’s continuous efforts led to the development of new techniques and strategies for the early prediction of cancer which helps in its treatment [7], like [8] Ostu’s thresholding technique for the segmentation of brain tumors using MRI images. Treatment of cancer patients further can be improved by survival time analysis [9]. The survival time refers to the duration between the date of diagnosis or initiation of treatment for a disease, such as cancer, and the conclusion of the observation period. Predicting cancer survivability is challenging due to the complex nature of cancer disease which is influenced by various genetic and environmental factors. Additionally, incomplete or noisy data, limited sample sizes, and variability in data collection protocols pose obstacles to accurate prediction. It’s crucial for therapeutics and clinical management because timely and accurate predictions can guide treatment decisions and help improve patient outcomes.

The approach for assessing the effectiveness of a new treatment in a clinical trial is to measure the overall survival. The accurate prediction of survival time can provide doctors with a better approach to the treatment of a person who is suffering from a disease. In past decades, high-throughput technologies have been utilized to predict survival time which helps to define prognostic indices for mortality or recurrence of disease and to thoroughly investigate the outcome of treatment. Survival time is computed using clinical, image, or genomic data. Among the different types of data, genomic data is the most valuable to have an accurate prediction. Genomic data provide information about molecular mechanisms of cancer development and progression. At the same time, other data types like clinical data or imaging data do not capture genetic changes in cancer patients. But including genetic data for research, helps in predicting survival time accurately by exploring the complexities of disease and also enables in identification of biomarkers of survival. These genomic data can be obtained from various open-source platforms like Gene Expression Omnibus (GEO) [10] and The Cancer Genome Atlas (TCGA) [11, 12]. AI and Machine learning are useful in the medical and healthcare fields, including disease detection, healthcare services, and industry applications. This review article concentrates on the application of both conventional machine learning and deep learning methods that utilize genomic data for predicting survival time. As genomic data becomes increasingly standardized and sophisticated analysis techniques continue to evolve, it has the potential to significantly enhance the development of robust algorithms for predicting survival time.

The highly complex as well as expensive genomic data analysis is a significant burden for clinicians in terms of diagnosis, prediction, and subsequent management. Correspondingly, the diagnosis and treatment planning are stagnant and fallible, as these rely on the physician's skills and expertise, which may be instinctive and inaccurate. Hence, quantitative measures are required and are best for the diagnosis. Advanced machine learning and deep learning techniques provide targeted solutions. By employing new learning techniques, clinicians can enhance the treatment planning of patients and achieve better outcomes. The use of machine learning and deep learning techniques in utilizing various types of genomic data for predicting the survival time is shown in Fig. 1.

Fig. 1
figure 1

Use of machine/deep learning techniques for survival prediction of cancer patients

The prediction of survival time for cancer patients comes with a few sequential steps: (1) preprocessing of genomic data, (2) dimensionality reduction, (3) feature selection, (4) model training, and (5) prediction of survival time. In the training stage, various type of genomic data (DNA Methylation [13], Copy number Alteration, mRNA, and other datatypes) is preprocessed. These features are then employed, either individually or in combination, to reduce the dimensionality, and the model is trained using various machine-learning techniques. Figure 2 illustrates these steps of survival prediction using machine learning techniques.

Fig. 2
figure 2

End-to-end framework for prediction of survival time using machine learning techniques

On the other side, advanced deep learning methods that utilize layered artificial neural networks (ANN) with supervised or unsupervised learning techniques automatically combine feature selection, dimensionality reduction, and prediction into a single process. As deep learning models strive to discover concealed patterns and connections, they usually perform better in predicting survival time than traditional machine learning approaches. With the increasing availability of genomic data and advanced processing capabilities, deep learning is gaining popularity as a powerful tool for genomic data analysis.

Existing literature for predicting survival time in cancer patients often relies on clinical or image data. However, these methods may not always provide accurate predictions, and these do not fully utilize the wealth of information available in genomic data. With genomic data, there is very little review on the survival prediction of cancer patients, but they are not too extensive and also there is no such comparison of different machine learning models which gives the idea of potential research in the future. This paper aims to comprehensively review the published literature regarding the use of machine learning and deep learning techniques for survival prediction of breast, glioblastoma, lung, renal cell, and oral cancer. The primary outcome indicator is the various biomarkers of survival of these cancer types and the accuracy in the prediction of survival using genomic data.

The main contributions of this paper are as follows:

  • Comprehensive review of the use of machine learning and deep learning techniques for survival prediction in various types of cancer.

  • Analysis of the effectiveness of these techniques in utilizing various types of genomic data for predicting the survival time.

  • Analysis of various feature selection methods for extracting the important features for survival prediction.

  • Identification of potential areas for future research and improvement in this field.

The rest of this review is organized as follows. Section 2 provides a detailed overview of the machine-learning approaches for cancer survival prediction. Section 3 presents a discussion of these techniques. Finally, Section 4 concludes the review and provides directions for future research.

2 Machine learning approaches for cancer survival prediction

2.1 Public databases for genomic dataset

Datasets are extensively accessible and selected based on the discretion of various research groups. However, most of them opt to build algorithms utilizing well-established cancer patient databases to enhance the research value. The most well-known databases for genomic data are The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). TCGA is the largest database with a dataset of more than 33 types of cancer that is freely available (https://portal.gdc.cancer.gov/). Another well-established database is The International Cancer Genome Consortium (ICGC) (https://dcc.icgc.org/)giving cancer researchers access to over 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors.

2.2 Performance measurements

In Cancer Research, different research groups utilize various algorithms with the use of different types of datasets. For comparison, various metrics are used by researchers. Some commonly used metrics are c-index, accuracy, sensitivity, specificity, mean mean squared error or mean absolute deviation, precision and recall, and overall survival time.

2.3 Selection procedure

The research was performed on the IEEE Xplore digital library, Web of Science, Science Direct, Google Scholar, and PubMed search engines using keywords “Genomic data”, “Survival” with “Breast cancer”, “Glioblastoma”, “Lung cancer”, “Renal cell cancer”, and “Oral cancer”. Afterward, these keywords are combined using the “AND” operator with “Machine Learning” or “Deep Learning” with “Survival prediction”. The articles that are in journals or conference papers were included. The inclusion criteria of the articles are machine learning or deep learning models in the prediction of survival using genomic data. The most recent papers were included in the study.

2.4 Machine learning approaches for survival prediction of cancer

Clinically, the survival prediction ability of cancer patients has a great impact as it helps in planning the treatment of patients [14]. This could be advantageous not only for the patients themselves but also for their families, who often undergo significant stress when dealing with cancer patients. In the long term, it could also result in cost savings for treatment. Therefore, in cancer research, much research focuses on predicting the survival time of patients and on predicting survival time using imaging datasets, clinical datasets, and genomic datasets [15]. As genomic data increases the accuracy in the prediction of survival time, it is worth including this genomic data for the study. However, with the inclusion of genomic data, huge dimensionality becomes a major problem that needs to be considered. Thus, several researchers make use of dimensionality reduction techniques to deal with the problem of high dimensionality. The survival prediction model has two main components: dimensionality reduction or feature selection and prediction model. These techniques are mainly of two types namely supervised and unsupervised. A few popular feature selection or dimensionality reduction techniques are Principal Component Analysis (PCA) [16], Non-negative factorization (NMF) [17], and Factor Analysis [16], etc. With feature selection or dimensionality reduction methods, the feature can be reduced. Additionally, recent advancements in digital healthcare have focused on utilizing fog and cloud networks for cancer detection, introducing novel paradigms such as the Multi-Cancer Multi-Omics Clinical Dataset Laboratories (MCMOCL) Schemes which incorporate federated learning, auto-encoder, and XGBoost methods to improve accuracy, reduce processing delay, and enhance security in heterogeneous cancer clinics [18] and other studies have explored hybrid cancer detection schemes utilizing SARSA reinforcement learning and multi-omics data processing in fog cloud networks, aiming to enhance accuracy and reduce processing time in distributed clinical settings [19].

Machine learning techniques help in extracting meaningful patterns from complex genomic data which is not feasible by traditional analytical methods [20]. By incorporating machine learning techniques with genomic data, the survival time can be predicted accurately. Several machine learning techniques can be used for the prediction and some commonly used techniques are Support Vector Machine (SVM), AdaBoost, Decision trees, and Random forest. While developing machine learning techniques for survival time prediction using genomic data, research should consider various factors such as feature selection, model interpretability, and validation methodologies. Feature selection techniques aim to identify the most informative genomic features while reducing noise and overfitting. Model interpretability ensures that predictions are clinically actionable. Validation methodologies such as cross-validation and external validation assess model generalizability and robustness across diverse datasets.

There are various challenges in applying machine learning to genomic data analysis for cancer survival prediction including data heterogeneity, incomplete data, feature biases, model overfitting, and interpretability issues. Limited sample sizes and imbalanced datasets can lead to biased model performance and poor generalizability. Furthermore, the complexity of genomic data necessitates sophisticated feature engineering and regularization techniques to prevent overfitting and enhance model interpretability. Addressing these challenges is critical to ensuring the reliability and clinical utility of predictive models.

Recent research on survival prediction using genomic data has focused on various cancer types such as breast cancer, lung cancer, colorectal cancer, ovarian cancer, glioblastoma, oral cancer, renal cell cancer, and cervical cancer. But this study focused on breast cancer, lung cancer, renal cell cancer, oral cancer, and glioblastoma. These cancer types represent a diverse spectrum of diseases with distinct molecular characteristics and prognostic factors, making them ideal candidates for genomic data analysis to improve survival prediction accuracy.

Some researchers focused on finding the biomarkers of the survival time of cancer using various machine learning algorithms. Various studies revealed that different Long non-coding RNA (lncRNA) [15, 21,22,23], genomic instability derived lncRNA [24], and autophagy-associated long noncoding RNAs (ARlncRNAs) [25, 26] act as a biomarker for the prediction of survival time. Apart from lncRNA, the Fanconi anemia pathway can act as a prognostic biomarker for survival prediction [27]. Studies also demonstratedthat transfer learning-based deep features [28],radiomics signature [29], ten glucose metabolism risk signature [30], prognostic index, stem cell-related gene signature [31], seven CPG-based signature [32], 6-gene signature [33], aggregated signature based on ligand-gated channel pathways [34], Radiomics signature [35], TLS [36], APE1 Polymorphism [37], Tp53 [28], COL4A5, ABCB1, NR3C2 and PLG [26] can act as predictive biomarkers while 5-snoRNA signature [38], cancer-associated fibroblasts [39], TP53 [40]are not a promising predictor for survival prediction. Research has been also performed to show the importance of tumor environment [41], age [42], and oral hygiene [43] in the prediction of survival time. Some authors identified various miRNA or mRNA genes [44, 45]and nomogram-based genes or miRNA signatures [45,46,47,48,49] which act as predictors for survival. Another study [50]implemented the ESTIMATE machine learning algorithm which identified IL10, IGLL5, and POU2AF1 prognostic biomarkers. The research was performed for lung adenocarcinoma risk by using Random forest, Univariate Cox, and SigFeature algorithms which identified 16-gene expression having a high correlation with patient risk [51]. Another study showed the significance of somatic mutation in survival prediction [52]. A study on the effect of race on survival concluded that African Americans had prolonged survival [53] and another approach interpreted that the difference in survival rates depends on gender [54]. Identifying biomarkers associated with survival time is a key focus of research, but challenges persist in determining the most predictive feature among the vast genomic features.

Probing further, researchers compared the different machine learning methods in the survival prediction. For instance, the survival time was estimated by implementing various algorithms namely: 1-Nearest Neighbor (1NN), Naive Bayes (NB), SVM, AdaBoost, Tree Random Forest (TRF), Radial Basis Function Network (RBFN), and Multilayer Perceptron models, out of which Trees Random Forest model (TRF) which is a rule-based classification model turns out to be the best in prediction with the highest level of precision [55]. Furthermore, a study that used six different machine learning models AdaBoost, NB, SVM, RF, Adabag, Least-Squares SVM (LSSVM), and two classical methods Logistic Regression (LR) and Linear Discriminant Analysis (LDA), for predicting survival time and metastasis of breast cancer concluded that SVM outshined other models by providing more accurate data [56].

Some researchers proposed new models for survival prediction using genomic data. In an experiment, Genomic data, and Pathological images Multiple Kernel Learning (GPMKL) model, based on Multiple Kernel Learning (MKL) used integrated pathological images and genomic data for survival time prediction. This model was created to execute feature fusion, which is a crucial aspect of breast cancer classification. The results indicate that integrating genomic data with pathological images produces better outcomes than using either genomic data or pathological images alone, for GPMKL with 95% specificity, sensitivity, accuracy, and precision were increased by 4.3%, 0.9%, and 3.8% respectively as compared to Genomic data based Multiple Kernel Learning (GMKL) and improved by 13.9%, 3.2%, and 16.4% as compared to Pathological images based Multiple Kernel Learning (PMKL) which proved GPMKL as deserving and useful in predicting human breast cancer survival [57].In another study, Immunohistochemistry was used to cluster the dataset based on receptor status in which significant variables were ranked by the random forest variable selection method and there was a multiplatform network named Multimodal AutoEncoders (MAE) that was implemented to classify breast cancer patients based on their survival rates and their subtypes. Survival rate prediction was performed using multitype modalities and the lowest mean square error was achieved with gene expression (0.16541). Moreover, decision tree (DT), NB, K-Nearest Neighbor (KNN), LR, SVM, RF, and gradient boosting trees (GBT) were implemented for the survival prediction which concluded that GBT and RF-based classifiers or regression models performed best [58]. A new algorithm Crystall was proposed for breast cancer which predicted the survival time of patients and classified the patients based on their survival time that is whether a patient would live longer than 5 years or not. The proposed one performed better for both problems and achieved a mean absolute error of 31.62 days for predicting how long a breast cancer patient will live within 5 years [59]. To improve the disease-free survival prediction performance of lung squamous cell carcinoma, a novel method named LSCDFS-MKL was proposed, which is based on multiple kernel learning. The model used the Gradient descent algorithm for solving various kernel learning problems and integrated pathological images and genomic data. The method increased the specificity, accuracy, and sensitivity by 2.20%, 2.68%, and 7.14% than to using genomic data only and improved by 9.89%, 24.11%, and 34.02% compared with pathological images only. The accuracy of LSCDFS-MKL was 100% for the prediction of disease-free survival and performed better than other prediction methods [60].

A new Ordinal Multi-Modal Feature Selection (OMMFS) framework [61] was developed to identify the features from pathological images, DNA methylation, mRNA, and copy number variation and used a sparse canonical correlational analysis framework with ordinal survival information. The results showed that this method has a better performance in patient stratification and can be used as the general framework for any cancer type for the prediction of biomarkers or to predict the response of any treatment. Another stratification method based on the Elastic net penalized Cox proportional hazard regression model was designed to group the advanced-stage oral cancer patients into different risk groups using genetic and clinicopathological features [62], which helped create an online calculator.

A novel integrative model based on the Bayesian averaging model for renal cell carcinoma was proposed, which used the dimensionality reduction technique PCA and Sparse PCA (SPCA) to generate features of the low dimension of three genomic data types and considered the interaction between the data types. The mean square error was calculated for both dimensionality reduction techniques and compared the results with and without the consideration of the interaction between data types. Results showed that the mean square error was the least for PCA with interaction (2.07). These models also validated the ccRCC-based biomarkers for renal cell carcinoma which was verified in the literature [63]. Another novel machine learning model [64]employed coherent voting networks and predicted the survival time of breast cancer accurately.

Other research [69] used the SVM model to investigate the relationship between glioma topographic location and molecular characteristics and suggested that tumor location plays a role in glioma development and could be used to improve treatment and predict outcomes. Another study [70] explored the role of anoikis resistance in breast cancer metastasis and treatment optimization. Through a comprehensive analysis of mRNA and lncRNA profiles, ten key mRNAs and six lncRNAs associated with anoikis were identified using the LASSO Cox regression model. [71] integrated multi-omics data with clinical factors to identify significant biomarkers for Glioblastoma Multiforme (GBM) prognosis. Employing the Multimodal iterative Random Forest (MiRF) algorithm, 35 molecular features comprising 19 genes and 16 proteins were isolated, distinguishing between short-term and long-term survival as well as high and low Karnofsky performance scores. Another study [72] integrated histology and genomics using the probabilistic graphical model framework. It used the multilayer perceptron model to generate informative embeddings capturing underlying cancer properties by canonical correlation analysis (CCA) and penalized variants (pCCA), and the model generates informative. Other research [73] utilized multi-omics data from lung adenocarcinoma patients to improve survival prediction accuracy. By using novel feature extraction techniques and unbiased selection methods, 32 molecular features were identified from the TCGA dataset, achieving an AUC of 0.839 for a 2-year survival prediction model.

Several machine learning algorithms have been implemented. However, the choice of algorithm often depends on specific cancer types and dataset characteristics.

Conventional machine learning models used in the prediction of survival of breast, glioblastoma, lung, renal cell, and oral cancer using genomic data are presented in Table 1. Machine learning algorithms predict cancer patients' survival very efficiently but need time-consuming and complex pre-processing techniques. The problems faced by machine learning models can be minimized with the use of deep learning techniques.

Table 1 Summary of machine learning methods for survival prediction of cancer patients using genomic data

2.5 Deep learning approaches for survival prediction of cancer

Deep Learning is an advanced method of machine learning for the processing of complex data. The process is regarded as deep because it comprises hidden layers, where the output of one layer is passed to the next. In contrast to traditional machine learning, deep learning algorithms typically do not require prior feature selection or extensive data pre-processing (although some pre-processing may be necessary). Instead, they employ either supervised or unsupervised training with multiple layers. In the existing literature, there were some review papers like [74] that review the various deep learning models based on multi-omics data for clinical implications and their challenges in using multi-omics data.

A deep learning method named Multimodal Deep Neural [75] was applied for survival prediction using clinical data, copy number alteration, and integrating gene expression. Three deep neural networks were constructed by taking into account the different data types to create a multimodal network. An Attention-based MultiNonnegative Matrix Factorization (AMND) algorithm was designed by integrating gene expression and clinical data. Nonnegative Matrix Factorization algorithms [75] were used to compute eigenvector weights to extract useful information from gene expression and clinical data. After that, the summation of weights of eigenvectors was concatenated with clinical data to feed into deep neural networks for classification. Deep learning-based concatenation autoencoder (ConcatAE) was developed to integrate features of different datasets and used cross-modality autoencoder (CrossAE) which predicts the overall survival time [77]. A new autoencoder-based feature extraction method named DeepSGP [78]was presented for glioblastoma patients which stratified the patients into different survival groups with an accuracy of 0.83 and a prediction of overall survival with an accuracy of 0.89. Another artificial intelligence-based approach [79]was proposed which used the SVM model on radiomic features and Cox-PH regression model with radiomics signature, clinical and genomics data to categorize the patients into different risk groups which gave the c-index of 0.75 with the combination of the dataset and 0.65 for clinical data only. In a different deep learning model, deep orthogonal fusion [80]used multimodal data to predict the overall survival of glioblastoma patients with a c-index of 0.788. DeepSurv with multi-omics data [81] for oral cancer predicted survival time with a c-index of 0.94. A deep learning approach was used to analyze the tumor-infiltrating lymphocyte (TIL) profiles to identify the association with survival. Also, 16 out of 22 TILs were different for predicted risk groups [82]2.

The study [83] utilized machine learning and deep learning techniques to identify prognostic biomarkers for predicting the time-to-development of oral cancer and stratifying survival among patients with premalignant lesions. Autoencoder deep learning neural network extracts features, which were further analyzed using a univariate Cox regression model. Supervised clustering based on encoded features distinguished high-risk and low-risk groups, while a random forest classifier identified gene profiles associated with oral cancer subtypes. Another research [84] introduced a hybrid deep learning model with clinical, gene expression, and copy alteration data for breast cancer prediction and survival prediction of patients. A novel predictive model [85] using a graph convolutional network (GCN) and Choquet fuzzy ensemble, integrating multi-omics and clinical data was introduced. The model achieved competitive performance metrics, including an accuracy of 0.820 and a balanced accuracy of 0.769, outperforming baseline models and demonstrating its efficacy in prognostic classification. A novel prognostic algorithm [86] by integrating pathogenomics and AI-based techniques. Machine learning and deep learning algorithms identified predictive features for survival outcomes, with the multimodal which outperforms unimodal and suggesting potential for personalized treatment strategies in oral cancer. A study introduced a Deep Convolution Cascade Attention Fusion Network (DCCAFN) for predicting lung cancer patients' survival based on imaging genomics [87]. The DCCAFN demonstrated effectiveness in multimodal data fusion, aiding physicians in risk stratification and personalized treatment decisions to improve patient's quality of life.

These studies show the potential of deep learning in enhancing survival prediction accuracy across various cancer types. Table 2 shows the deep learning algorithms for survival prediction of breast, glioblastoma, lung, renal cell, and oral cancer using genomic data.

Table 2 Summary of deep learning methods for survival prediction of cancer patients using genomic data

The comparison of various machine and deep learning algorithms for survival prediction in terms of accuracy evaluation parameters are shown in Table 3 which can help the researchers to select the best algorithm for survival prediction.

Table 3 Comparison of various machine and deep learning methods for survival prediction of cancer patients using genomic data in terms of accuracy

2.6 Feature selection methods for survival prediction of cancer

Dimensionality reduction techniques can help reduce the number of features in a dataset by identifying a smaller set of representative features that capture the most important information. However, even after dimensionality reduction, there may still be redundant or irrelevant features in the remaining set of features. The researchers have used feature selection methods that can help address this issue by identifying and selecting only the most relevant features for a particular task. The following are the methods that are commonly used by researchers for dimensionality reduction or feature selection techniques.

  1. I.

    Factor Analysis:- Factor analysis is a dimensionality reduction technique that simplifies complex data sets by identifying underlying factors or dimensions that explain the patterns and relationships in the data. It identifies key factors that contribute to the variance in the data [88].

  2. II.

    Principal Component Analysis (PCA):-PCA is a widely used dimensionality reduction technique that identifies the key features or components that explain the variance in a dataset. It works by transforming the original variables into a new set of uncorrelated variables called principal components, which capture the most important information in the data [89].

  3. III.

    Sparse PCA:-Sparse PCA is a variant of PCA that produces sparse solutions by promoting sparsity in the loadings of the principal components. This means that it identifies a smaller number of key features or components that contribute most to the variance in the data while setting the remaining loadings to zero [90].

  4. IV.

    Kernel PCA:-Kernel PCA is a nonlinear dimensionality reduction technique that extends the linear PCA to handle nonlinear relationships in the data. It works by projecting the data into a high-dimensional feature space using a nonlinear kernel function and then applying PCA to the resulting kernel matrix. This allows it to capture nonlinear variations in the data and identify the key components that explain the variance in the feature space [91].

  5. V.

    LASSO:-LASSO (Least Absolute Shrinkage and Selection Operator) is a dimensionality reduction technique that selects a subset of relevant features by imposing a penalty on the absolute values of the regression coefficients. This encourages sparsity in the model and effectively sets some of the coefficients to zero, leading to a simpler and more interpretable model [92].

  6. VI.

    Autoencoder:-Autoencoder is a neural network architecture that can be used for unsupervised dimensionality reduction. It works by encoding the input data into a lower-dimensional representation, also known as a latent space, and then decoding it back to the original dimensions. The encoder and decoder are trained together to minimize the reconstruction error between the input and output data. By constraining the size of the latent space, the autoencoder can effectively reduce the dimensionality of the input data, while preserving its essential features [93].

  7. VII.

    Fselector:- In Fselector feature selection, the F-test is used to measure the dependence between each feature and the target variable. The F-test calculates a score for each feature, which represents the degree of correlation between the feature and the target variable [94].

  8. VIII.

    mRMR:- The mRMR algorithm selects features by maximizing the relevance criterion and minimizing the redundancy criterion. It first selects the feature with the highest relevance and then selects additional features that have high relevance but low redundancy with the previously selected features. This process continues until the desired number of features is selected [95].

  9. IX.

    Non-negative Matrix Factorization (NMF):- The NMF algorithm decomposes a given matrix X into two non-negative matrices W and H, where W represents the set of basis vectors (latent features) and H represents the set of coefficients (weights) that combine these basis vectors to approximate the original matrix X. The NMF algorithm seeks to find the best values for W and H such that their product approximates the original matrix X [96].

  10. X.

    Log-rank test:- The Log-rank test works by dividing the population into two or more groups based on the values of a given feature. It then calculates the survival function for each group and compares them using a statistical test such as the log-rank test or the Wilcoxon test. The p-value obtained from the test indicates whether the survival curves are significantly different or not. If the p-value is below a certain threshold (e.g., 0.05), it suggests that the feature is an important predictor of survival [97].

  11. XI.

    Minimal Depth:- The algorithm works by constructing a decision tree using all available features and calculating the minimal depth of each feature. The features with the smallest minimal depth are considered the most important, as they appear closer to the root of the decision tree and have a greater influence on the final decision [98].

  12. XII.

    Linear Correlation:- The algorithm works by calculating the Pearson correlation coefficient between each feature and the target variable. The Pearson correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no correlation. Features with a high absolute value of the Pearson correlation coefficient are considered strongly correlated with the target variable and are selected for further analysis [99].

  13. XIII.

    Transfer Learning learning is a feature selection algorithm used in machine learning to identify the most relevant features for a target task by leveraging knowledge from a related source task. The algorithm works by first training a model on a related source task using a large set of features. The trained model is then used to extract features from the source task that are relevant to the target task. These extracted features are then used as the input for a model trained on the target task [100].

  14. XIV.

    Scree Plot:- Scree plot feature selection is a graphical method used in the principal component analysis (PCA) to identify the most important principal components (PCs) and, consequently, the most relevant features in a dataset. To use scree plot feature selection, the number of principal components to retain is selected based on the elbow in the scree plot. The corresponding PCs and their corresponding loadings (weights) are then used as the most important features in the dataset [101].

The comparison of various machine learning models based on the dimensionality reduction or feature selection method is shown in Table 4. Despite the advancements in feature selection, challenges remain in identifying the most informative features for survival prediction. The selection of appropriate feature selection methods depends on specific datasets and cancer types representing an ongoing research gap in the field.

Table 4 Comparison of models based on feature selection and dimensionality reduction methods

Overall, machine learning and deep learning approaches hold promise for enhancing cancer survival prediction, and addressing research gaps related to feature selection, algorithm selection, and model interpretability is essential for advancing the field and translating findings into clinical practice.

3 Discussion

Machine learning techniques have shown promising results by improving the accuracy of survival time prediction of cancer patients. By integrating multiple types of omics data, such as DNA methylation, copy number alteration, and mRNA expression, machine learning algorithms can identify patterns and relationships that may not be apparent through individual omics analyses. Numerous studies have investigated the identification of biomarkers for predicting the survival time of cancer patients using various machine-learning algorithms. Different Long non-coding RNA (lncRNA) [12, 16,17,18,19,20,21], the Fanconi anemia pathway [22], transfer learning-based deep features [23], radiomics signature [24, 30], ten glucose metabolism risk signature [25], prognostic index, stem cell-related gene signature [26], seven CPG-based signature [27], 6-gene signature [28], aggregated signature based on ligand-gated channel pathways [29], TLS [31], APE1 Polymorphism [32], Tp53 [23, 35], COL4A5, ABCB1, NR3C2 and PLG [21], and ESTIMATE machine learning algorithm [47] have been found to act as prognostic biomarkers. In contrast, 5-snoRNA signature [33], cancer-associated fibroblasts [34], and TP53 [35] have not shown promising results for survival prediction. Additionally, the effect of the tumor environment [36], age [37], and oral hygiene [38] on survival time prediction has been investigated. Some authors have identified various miRNA or mRNA genes [39,40,41,42] and nomogram-based genes or miRNA signatures [40, 43,44,45,46] that act as predictors for survival. Several studies have compared different machine learning methods, including 1-Nearest Neighbor (1NN), Naive Bayes (NB), Support Vector Machine (SVM), AdaBoost, Tree Random Forest (TRF), Radial Basis Function Network (RBFN), Multilayer Perceptron, AdaBoost, Least-Squares SVM (LSSVM), Logistic Regression (LR), and Linear Discriminant Analysis (LDA) [52, 53]. The TRF and SVM models have been found to provide the best results in predicting survival time and metastasis of breast cancer [52, 53]. Some researchers have proposed new models for survival prediction using genomic data, including GPMKL based on Multiple Kernel Learning (MKL) [54], Multimodal AutoEncoders (MAE) [55], and Crystall [56]. These models have demonstrated improved accuracy and precision in predicting human breast cancer and lung squamous cell carcinoma survival time.

A variety of deep learning-based methods have been applied to predict survival in cancer patients using multimodal data, including clinical data, copy number alteration, gene expression, and radiomic features. These methods include Multimodal Deep Neural [62], Attention-based MultiNonnegative Matrix Factorization (AMND) [62], ConcatAE [77], CrossAE [77], DeepSGP [66], SVM model [67], Cox-PH regression model [67], deep orthogonal fusion [68], DeepSurv [75], and TIL profiling [76]. These methods have demonstrated high accuracy in predicting overall survival, with c-index values ranging from 0.75 to 0.94.

To the best of our knowledge, this is the first review of the application of Machine Learning to survival prediction by making use of genomic data for breast, glioblastoma, lung, renal cell, and oral cancer. In the review, most of the studies used the open-access database. However, there are certain issues with public databases like data is not updated at regular intervals. Therefore, research should focus on collecting data from different private or public hospitals by obtaining ethical consent from patients and hospitals.

Various feature selection methods used by researchers are Fselector [65], autoencoder [58, 78],mRMR [75], NMF [76], Long Rank test [61], minimal depth [65], DeepSGP [78], Transfer learning [23], Linear correlation [60] and ScreePlot [75]. There is a need to compare the various feature selection methods in predicting the survival time of cancer for cancer. This would help the researchers to choose the best feature selection method.

The size of the dataset of cancer patients used in the current study is between 100 to 2000 and genomic data has a large number of features that are difficult to process with machine learning algorithms. Therefore, there is a need to use appropriate feature selection or dimensionality reduction techniques to select important features. Also, the dataset contains outliers, noise, and missing values. In future studies, researchers should not only use appropriate machine learning methods but also consider various preprocessing and feature selection methods.

The most commonly used algorithms in this review are Cox regression, LASSO regression, Random forest, and Machine kernel learning, and only a few studies used the deep learning approaches. In future studies, deep learning approaches should be explored for the survival prediction of cancer patients using genomic data. Different techniques can be combined to produce the best results.

There are various areas where further investigation or improvement can be performed such as comparing the results by applying both early and late integration for multi-omics data which can lead to selecting the best integration approach for survival prediction. It has further scope to consider intra and inter-interaction effects between different data types of multi-omics data. There is a need to compare various feature selection techniques to identify the most effective methods for predicting cancer survival. This would help researchers choose the best approach based on the specific characteristics of their dataset. While traditional machine learning methods like Cox regression, LASSO regression, Random forest, and Machine kernel learning have been commonly used, only a few studies have explored deep learning approaches. There is scope for applying deep learning approaches for survival prediction using multi-omics data. Further, machine learning and deep learning techniques can be combined to achieve the best performance which can be further explored.

4 Conclusion

The paper is a comprehensive review of the most recent machine learning-based approaches for predicting cancer patient survival, with a focus on the use of genomic data. The paper covers various cancer types, including breast cancer, glioblastoma, lung cancer, renal cell cancer, and oral cancer, and discusses the use of different machine learning techniques, such as random forests, support vector machines, neural networks, and deep learning algorithms. This paper also highlights the challenges involved in developing accurate survival prediction models, such as the need for large and standardized datasets with detailed genomic and clinical information. In addition, various dimensionality and feature selection methods are also compared, which can help improve the accuracy and generalizability of the models.

The key contribution of the research is to highlight the impact of machine and deep learning in the survival prediction of cancer patients. This review paper helps researchers explore the potential of an integrative approach to genomic data in survival prediction, which helps clinicians make informed decisions that further improve treatment outcomes.

Despite the advancements highlighted in this review, several limitations persist in the field of cancer survival prediction. One notable limitation is the access to standardized data, which may introduce biases in the prediction. Deep learning and machine learning started a new revolution in the survival prediction of cancer patients and there is still much scope for further improvement. Data from cancer patients have different formats, e.g., miRNA, mRNA, copy number variation, clinical data, etc. With the advancement of new technologies, working with these data types has become easy but there is still a need to explore various feature selection or dimensionality reduction techniques for handling a large number of features of genomic data. Addressing these limitations will be crucial for realizing the full potential of machine learning in survival prediction.

In the coming times, work should continue focusing on testing and improving the algorithm and state-of-the-art models to improve cancer patients' survival prediction. Moreover, there is a great scope to work with time-series data of cancer patients for better prognosis and to improve survival time. The impact of early and late integration of genomic data on survival prediction can further be explored.