Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Kishore, Akash; Venkataramana, Lokeswari; Prasad, D. Venkata Vara; Mohan, Akshaya; Jha, Bhavya

doi:10.1007/s11517-023-02892-1

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Original Article
Published: 02 August 2023

Volume 61, pages 2895–2919, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Download PDF

579 Accesses
4 Citations
Explore all metrics

Abstract

Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data–based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages — stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 — minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 — mRMR, mutual information (MI) and CFS, and Pipeline 3 — mRMR and support vector machine–recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

Graphical Abstract

Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning

Article Open access 05 March 2020

L1-Regulated Feature Selection and Classification of Microarray Cancer Data Using Deep Learning

DeepCC: a novel deep learning-based framework for cancer molecular subtype classification

Article Open access 16 August 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cancer is a genetic disease and globally one of the leading causes of death. In 2020, 2.3 million women were diagnosed with breast cancer and 685,000 died [35]. Charting out plans for treatment and prognoses have to factor in the stage of cancer. Automating this process using artificial intelligence methods is of great research interest as it would eliminate the need of assessments by professionals and would also ensure greater data collection from a single procedure. Cancer prediction models to date have depended on neural networks to uncover complex connections in the data. Depicting the metamorphosis of genotype into phenotype by inspecting the transcribed mRNA count in a genomic system is called gene expression. The most popular standardised ways to recognise gene expression variation are RNA sequencing and microarray data. RNA-Seq or RNA sequencing is a next-generation sequencing method that measures the presence and change in the RNA quantity in a sample at any given time [28]. Microarray-based gene expression profiles are widespread in cancer research for biomarker identification in the prediction of clinical endpoints like diagnosis, prognosis, and treatment response prediction. They use microarrays to calculate the relative activity of previously marked target genes [10, 17]. RNA-Seq gives greater coverage and resolution of the changing nature of the transcriptome compared to microarray-based techniques [25].

PET and MR imaging techniques, although widely available for the scope of early breast cancer detection, rely on physical features which do not provide insights into cancer progression causing molecular events [3]. On the other hand, gene expression analysis can capture early stage indicators as well as ascertain molecular events that show early to late stage disease advancement. So, gene-expression data can be used to identify and classify the stages of cancer. By nature, gene-expression cancer data brings with it some challenges, including high dimensionality and class imbalance. Hence, appropriate feature selection methods and class balancing techniques must be applied. While machine learning models for predicting the stage and type of cancer exist [36, 37], no attempts have been made to pre-process the gene expression data, apply deep learning models [8, 18], and determine the stage of cancer with high accuracy. Therefore, there is a need to explore multiclass classification of cancer stages, using a hybrid feature selection technique and deep learning models.

2 Related work

A survey of neural-network-based cancer prediction models from microarray data [7] surveyed papers published between 2003 and 2018 on neural networks, gene-expression data, and cancer prediction, and covered cancer classification, discovery, survivability prediction, and statistical analysis models. Pre-processing methods covered included affymetrix normalisation, fragments per kilobase per million (FPKM) normalisation, and zero mean one unit variance normalisation. Synthetic minority oversampling technique (SMOTE) and other oversampling techniques were used for class balancing. Deep MLP models, generative models, extreme learning machines, convoluted neural network (CNN), and genetic algorithms were used for classification of cancer and for cancer survivability prediction.

The initial findings of gene expression profile of peripheral blood mononuclear cells may contribute to the identification and immunological classification of breast cancer patients [30] which imply that evaluating gene expression trends of PMBCs can be a less invasive diagnostic method and helpful in giving insights into breast cancer biomarkers.

In [22], the authors propose a novel machine learning method using transfer learning for reconstructing gene regulatory networks (GRNs) from gene expression data. The method leverages knowledge from a source organism’s GRN to reconstruct the GRN of a target organism, and performs well in positive-unlabelled settings and demonstrates superior performance compared to state-of-the-art approaches, identifying previously unknown functional relationships among genes in the human GRN.

The combined pN stage and breast cancer subtypes in breast cancer: a better discriminator of outcome can be used to refine the 8th AJCC staging manual [38], suggests that the combined pN stage and breast cancer subtypescan predict and discriminate between breast cancer results.

KRAS expression is a prognostic indicator and associated with immune infiltration in breast cancer [20], concluded that KRAS expression can indicate the breast cancer prognoses and is closely linked to tumour immune infiltration.

Microarray cancer feature selection: Review, challenges, and research directions [17] present an extensive survey of studies on microarray cancer classification with a focus on feature selection methods. The use of filter, wrapper, and embedded and hybrid approaches to feature selection were covered. The list of techniques discussed are filter techniques: correlation-based feature selection, the fast correlation-based filter (FCBF) technique, the INTERACT algorithm, information gain, ReliefF, mRMR algorithm, consistency-based filter; wrapper techniques: ant colony optimization, distance sensitive rival penalised competitive learning–support vector machine (ADSRPCL-SVM) genetic algorithm with SVM; embedded techniques: SVM-RFE; hybrid approach: a combination of statistical and machine learning approaches, such as ANOVA and LDA coupled with SVM and filtering using mRMR followed by NB and SVM.

A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumour classification [29] described an effective hybrid gene selection method based on ReliefF [33] and ant colony optimisation (ACO) algorithm called RFACO-GS for tumour classification. It was tested on four datasets — colon cancer, leukaemia, lung cancer, and prostate cancer. The classification accuracy of RFACO-GS, 94.3%, was found to be highest out of the algorithms implemented.

The authors in the work [39] propose a model called laminar augmented cascading flexible neural forest (LACFNForest) for the classification of cancer subtypes. The model utilises a cascading flexible neural forest with a hierarchical broadening ensemble method and an output judgment mechanism to improve classification accuracy and reduce computational complexity. Experimental results on RNA-seq gene expression data demonstrate that LACFNForest outperforms conventional methods in cancer subtype classification, offering a promising approach for ensemble learning of classifiers with improved accuracy and robustness.

Identification of gene-expression signatures and protein markers for breast cancer grading and staging [36] described a computational method for prediction of gene signatures for breast cancer stages based on RNA-seq data using the TCGA [31] breast cancer dataset. The Wilcoxon signed-rank test was applied to identify genes that are differentially expressed in cancer versus control samples. Spearman correlation coefficient was used to assess the level of correlation between the average gene expression and the sample stage for identifying genes whose expression change go up or down with respect to stages. The Mann Whitney test is then applied to identify the differentially expressed genes among the different stages. Pathway enrichment analysis was performed. SVM-RFE approach was applied to predict gene signatures for each breast cancer grade as well as stage. A 30-gene panel and a 21-gene panel are predicted as gene signatures for distinguishing advanced stage (stages 3—4) from early stage (stages I–II) cancer samples and for distinguishing stage 2 from stage 1 samples, respectively.

Classification models for invasive ductal carcinoma (IDC) progression, based on gene expression data-trained supervised machine learning [27], covered staging of IDC samples using machine learning algorithms. Samples bearing clinical stages of stages 1 and 2 were pooled together as ‘early stage’, while stages 3 and 4 were pooled together as ‘late stage’. Near zero variance features and features having correlation coefficients more than 80% were removed. The training datasets were standardised using z-score normalisation. Feature selection algorithms such as RFE, RLASSO, random forest, linear modelling, and linear regression were implemented. In order to get consensus ranking, the overall mean of each feature rank obtained from an individual method was calculated. Subsequently, the top 20, 30, 40, 50, 60, and 80 features were used to train and evaluate accuracy of models for binary classification of early vs late IDC, based on 5 machine-learning methods namely RF, Naive Bayes, SVM, logistic regression, and decision tree. The feature list which gave the highest accuracy for all the machine-learning methods was selected for model generation and evaluation. The classification accuracy of the generated prediction models ranges from 74 for SVM to 95% for random forest, and auROC value ranges from 0.76 for LR to 0.93 for the random forest trained model for complete gene expression-based model.

In deep learning for stage prediction in neuroblastoma using gene expression data [23], the dataset to build a model was obtained through the Gene Expression Omnibus (GEO) [4] and TCGA. DNN Classifier on TensorFlow was used to classify the neuroblastoma dataset into 5 stages — 1, 2, 3, 4, and 4S. Stages 1 and 4 could be distinguished in neuroblastoma patients. Considering the poor prediction of the other stages in the test set, it is likely that overfitting occurred for stages 2, 3, and 4S, small size of dataset (280 cases). The accuracy calculated from each training set and test set was found to be 100% and 55.56%, respectively. The stage wise (1, 2, 3, 4, and 4S) one-vs-rest (OVR) AUCs were 0.8, 0.66, 0.59, 0.85, and 0.58, respectively.

A novel machine learning method was proposed by exploiting the knowledge about the gene regulatory networks (GRNs) from gene expression data of a source organism for the reconstruction of the GRN of the target organism, by means of a novel transfer learning technique. The results of proposed methods outperform state-of-the-art approaches and identify previously unknown functional relationships among the analysed genes [22]. A laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design [39].

The inference of gene regulatory networks (GRNs) is of great importance for understanding the complex regulatory mechanisms within cells. The emergence of single-cell RNA-sequencing (scRNA-seq) technologies enables the measure of gene expression levels for individual cells, which promotes the reconstruction of GRNs at single-cell resolution. The authors proposed a multi-view contrastive learning (DeepMCL) model to infer GRNs from scRNA-seq data collected from multiple data sources or time points. An attention mechanism is introduced to integrate the embeddings extracted from different data sources and different neighbour gene pairs [21].

In gene expression classification based on deep learning [2], gene expression data of 4 types of cancer were used: diffuse large B cell lymphoma, prostate cancer, leukaemia, and colon cancer. Min–max normalisation technique was applied. Four deep learning models were applied on the cancer classification task, and the results were compared. The models used were deep neural network, recurrent neural network (RNN), convolutional neural network, and modified DNN: DNN in combination with dropout. The performance of the models was evaluated using the accuracy measure. It was found that the modified DNN model performed best across the datasets.

In integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling [5], heterogeneous datasets of microarray data and RNA-Seq data were integrated to identify gene expressions and classify genes as possible biomarkers for breast cancer. Overall, classification models tend to perform poorly with respect to minority classes and usually overfit during training leading to incorrectly high accuracy. Gene expression data is characterised by high dimensionality and selecting the most important features from this data reduces computational cost. Hence, the construction of a hybrid model to use deep learning on gene-expression data, in order to gain insight into and improve the results of cancer staging prediction, has been proposed in the coming sections.

3 Materials and methods

3.1 Proposed methodology

The proposed methodology for multiclass classification of cancer stages has been detailed in Fig. 1.

3.2 Data extraction and pre-processing

Gene expression data is extremely high dimensional by nature. Often, the number of samples is in the order of tens and hundreds, while the number of features is close to 20,000. This poses serious computational challenges. Efficient methods that can capture the required information from a select group of features while not compromising on classification performance, computational, and time requirements are crucial. Gene expression data is mostly of 2 types, RNA-Seq and microarray. Having explored that study reproducibility and data-model sensitivity is an issue in medical datasets, the two chosen datasets from TCGA (RNA-Seq) and METABRIC (microarray) were combined. This is extremely important since most of the earlier studies would have used microarray, but more recent studies would be using RNA-Seq as it continues to rise in popularity. This would mean that the model, if trained properly, could accept any sample as input and classify it into the correct stage [32]. The steps involved in combining the datasets are detailed in Fig. 2.

3.3 Hybrid feature selection

High-dimensional data and class imbalance were two other issues that were identified in current work. As such, a good feature selection method would be crucial to the success of the task of staging cancer. Since each method has its own strengths and weaknesses, a combination of different types of feature selection methods might prove fruitful by utilising the advantages of each. It also adds a level of confidence since the selected features would be due to the consensus of the selected methods. Therefore, experiments were conducted to identify the optimal feature selection pipeline for a deep learning model. Based on available literature, possible choices and combinations of filter and embedded feature selection methods were selected. Three pipelines were built for multiclass classification, and two from those three were implemented [14, 16, 24] for binary classification. The filter methods used are mRMR [9], CFS and MI. SVM-RFE, an embedded technique, is also used. The three pipelines are described in Fig. 3.

The three methods were chosen for their performance on gene expression datasets in other works. mRMR has been shown to be successful in selecting features and hence was chosen for all three pipelines. In order to identify which combination performs best, the other methods chosen varied.

3.4 Deep learning models for classifying cancer stages

Finally, the choice of deep learning model was also made experimentally. Most previous works used machine learning methods, and only a handful used deep learning algorithms.

Therefore, combinations of feature selection and deep learning methods were executed to find the optimal combination. The deep architectures selected were deep neural network, convolutional neural network, deep neural network with dropout, recurrent neural network, and AutoKeras generated model. DNN and dropout were chosen specifically since there was a possibility of the model overfitting the dataset due to class imbalance. AutoKeras is a tool that identifies the optimal model architecture for a given dataset. Since it aligned with the objective to find the best deep learning model [11], AutoKeras was used to identify other possible architectures that may perform well. The deep learning classification models used were constructed using the Tensorflow framework [1].

3.4.1 Deep neural network

The deep neural network model used consisted of three dense layers, with the activation function ReLU (Fig. 4).

3.4.2 Convolutional neural network

The convolutional neural network model used consisted of a 1D convolution layer, Max pooling operation followed by flattening the input (Fig. 5).

3.4.3 Modified DNN − DNN + dropout model

This model is a modified version of the DNN discussed previously, with each dense layer followed by a dropout layer. Below is the plot of the modified DNN model (Fig. 6).

3.4.4 Recurrent neural network

This model made use of the simple RNN layer, a fully connected RNN where the output is to be fed back to input (Fig. 7).

3.4.5 AutoKeras

AutoKeras [19] is a publicly available library designed to facilitate automated machine learning (AutoML) processes specifically tailored for deep learning models. It leverages Keras models, implemented through the TensorFlow tf.keras API, to conduct the search.

3.5 Dataset

Gene expression data for invasive ductal carcinoma was extracted from 2 publicly available sources, TCGA (RNA-Seq) and METABRIC (microarray) (Tables 1 and 2).

Table 1 Class-wise distribution of combined dataset

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

Abstract

Graphical Abstract

Similar content being viewed by others

Classification models for Invasive Ductal Carcinoma Progression, based on gene expression data-trained supervised machine learning

L1-Regulated Feature Selection and Classification of Microarray Cancer Data Using Deep Learning

DeepCC: a novel deep learning-based framework for cancer molecular subtype classification

Explore related subjects

1 Introduction

2 Related work

3 Materials and methods

3.1 Proposed methodology

3.2 Data extraction and pre-processing

3.3 Hybrid feature selection

3.4 Deep learning models for classifying cancer stages

3.4.1 Deep neural network

3.4.2 Convolutional neural network

3.4.3 Modified DNN − DNN + dropout model

3.4.4 Recurrent neural network

3.4.5 AutoKeras

3.5 Dataset

4 Results

4.1 Performance analysis metrics

4.2 Feature selection

4.3 Class balancing

4.4 Deep learning classification models

4.5 Results and inferences

4.5.1 Performance prior to feature selection

4.5.2 Feature selection using Pipeline 1 (mRMR and CFS)

4.5.3 Feature selection using Pipeline 2 (mRMR, MI, and CFS)

4.5.4 Feature selection using Pipeline 3 (mRMR and SVM-RFE)

4.5.5 Inference from multiclass classification results

4.5.6 Binary classification

4.5.7 Inferences

4.6 Comparison with existing research work

4.6.1 Binary classification for invasive ductal carcinoma cancer from GEO database

4.6.2 Multiclass classification on neuroblastoma cancer data from GEO database

Comparison of AUC scores on IDC dataset with relevant research in literature

Comparing classification performance of hybrid feature selection pipeline on neuroblastoma dataset

5 Discussion

6 Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Informed consent

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation