Keywords

1 Introduction

In the last decades, the remarkable advances in microarrays technology opened huge opportunities in genomic research and especially in cancer researches to move from clinical decisions and standard medicine toward personalized medicine. The analysis of gene expression level may reveal a lot of informations about the cancer type, its outcomes also allow the possibility to predict about the best therapy in order to improve the survival rate.

Gene expression microarrays is a new breakthrough technology developed in the late 1990s [1] that can measure the gene expression level of thousands of genes corresponding to different samples or experiments simultaneously [2]. Many solution schemes for cancer classification and therapy process on molecular and cellular levels may be concluded from the analysis and the comparison of the generated data through different experiments [3]. Microarrays technology has two variants in the market [3], (1) cDNA microarrays-On Spotted array- and (2) oligonucleotide microarrays-On GeneChip-. cDNA microarrays are cheaper and more flexible as custom-made arrays, it was developed at Stanford University. While oligonucleotide arrays (developed at Affymetrix) are more automated, stable, and easier to be compared through different experiments [3, 4]. The data produced by microarrays technology represent the result of thousands of genes for few experiments where this matrix can be used to evaluate the variation of gene through samples or the interaction of genes in different samples.

Since DNA microarray technology allows to analyse the gene data quickly and at one time in order to get the expression pattern of a huge amount of genes simultaneously [5], gene expression data are unique in their nature due to three reasons: (1) their high dimensionality (more than thousands of genes), (2) the publicly available data are very small just hundred or fewer of samples, (3) a big partial of the genes are irrelevant in cancer classification and analysis, where the problem is to find the difference between cancerous gene expression tissues and non-cancerous tissues. For these reasons, and in order to handle those kind of data researchers proposed that feature selection and/or dimensionality reduction is a relevant process in order to take advantage of the data and to converge toward accurate classifiers. Several machine learning methods have been used in caner classification, yet recently deep learning start to be investigated as well in this process due to its ability to work on raw and high dimensional data.

The paper investigates the use of advanced machine learning to handle large scale gene expression data to enhance cancer classification. Also it explores the potential of deep learning based classifiers to manage such datasets. Hence, we propose a simple feed forward neural network and implement four yet powerful classical classifiers namely, support vector machine (SVM), k-nearest neighbours (KNN), bayes naive (BN) and shallow neural network (SNN). We tested the four classifiers along with the deep classifier on publicly available five cancer datasets in the omnibus library. the cancer types are: Leukemia cancer, inflammatory breast cancer, lung cancer, bladder cancer and thyroid cancer

The remainder of the paper is organized as the following: the first Sect. 2 highlights the used classification methods. Then Sect. 3 presents an overview on the recent works related to machine learning and deep learning for gene expression and cancer classification. In Sect. 4 we explained our proposed deep feed forward neural network for the discussed problem. Then the used datasets are described in Sect. 5. Section 6 deals with the experimental study and presents the obtained results and our discussion. Finally in Sect. 7 conclusions are drawn.

2 Classification Methods

Many classification methods have been introduced through time. In the following we present four main methods.

2.1 K-Nearest Neighbours

K-nearest neighbours (KNN) classifier is the simplest supervised classifier that attempts to find the class membership of an unknown instance in the testing dataset \(\{X\}\) on the basis of the majority vote of the k-nearest neighbours [6]. KNN is a lazy learning or an instance based learning, where the function is approximated locally and all the computation is postponed until classification [5]. When classifying a sample x, the KNN classifier finds in the testing set \(\{X\}\) the most similar k examples to x and then chooses the most appropriate label class among this examples, by calculating the similarities between the attributes of the object x and the k samples. The simplest or the most used way to calculate the similarity between x and y is the geometric distance [7].

2.2 Support Vector Machine

Support Vector Machine (SVM) is also a supervised machine learning tool, that was introduced and implemented in 1995 [8] for pattern recognition. SVM was widely used for both classification and regression tasks [9]. The concept of SVM is based on [8, 10,11,12]:

The \(\{X\}\) instances of the training data set are plotted in some high-dimensional features space, where the task is to find the support vectors that maximise the margin (also the optimal hyperplane) not between the vector and the data but between the classes in the space (see Fig. 1).

Fig. 1.
figure 1

An SVM example represents the maximum margin between classes in two dimensional space [8]

2.3 Naive Bayes Classifier

Naive Bayes classifier (NB) as well is one of the first simple supervised machine learning. It is a probabilistic model based on the Bayesian formula to calculate the probability of class A given the values \(B_i\) of all attributes for an instance to be classified [13]. NB classifiers follow the assumption that all attributes of a given example are independent of each other, which facilitates the learning phase because every parameter can be learned separately, especially in the scalable data [14]. Naive bayes classifier have been intensively used in different fields such as document classification [14], Medical application like EGG signal analysis [15], music emotion classification [13] based on lyrics (text) analysis, and for image classification [16] as well.

2.4 Deep Learning

Deep Learning (DL) is the new breakthrough in machine learning and Artificial intelligence. DL migrates with machine learning technique from hand-designed features toward data-driven features-learning, where deep learning can learn complex models through simple features learned from raw data [17].

Deep Neural Networks (DNN) were the best showcase of deep learning with the aspect of multilayer that offers the possibility to explore the hierarchical representation of data by increasing the level of abstraction [18]. This properties allowed DNN to demonstrate state-of-the-art performance in different domains [19,20,21].

In deep learning we can find: (1) deep neural networks (DNN), (2) convolution neural network (CNN) and (3) recurrent neural network (RNN). DNN is the simplest representation of multilayer neural network. It may be either a multilayer perceptron , auto encoders (AE), stacked auto encoders (SAE), deep belief networks (DBN) or boltzman machine. While (2), convolution neural networks are built upon three majors layers convolution layers, max-pooling layers and and non-linear layer. At each convolutional layer a group of local weighted sums called features are obtained. At each pooling layer, maximum or average sub sampling of non-overlapping regions in feature maps is performed which allows CNNs to identify more complex features [17, 18]. RNNs, they are designed to use sequential information, and they have a basic structure with cyclic connection. Past information is implicitly stored in the hidden units called state vectors using an explicit memory long short term memory, and the current output is computed based on all the previous input through this state vector [17].

3 Machine Learning in Gene Expression Cancer Analysis Related Work

Both supervised and unsupervised methods have been used in gene expression data analysis. in 1998 a cluster analysis based on graphical visualisation method to reveal correlated patterns between genes were proposed in [22]. Supervised machine learning served microarrays data analysis intensively and effectively [5]. Neural network were proposed in [23] for Cancer classification and diagnostic prediction. Li et al. [24] proposed a genetic algorithm/k-nearest neighbours approach in order to select effective genes that can be highly discriminative in cancer sample classification, by splitting the set of genes into several subsets and then calculate the frequency of genes’ membership to the subset. After a number of iterations the genes with high frequency are the most relevant to the classification. The latter was used recently in [25] in order to select the most discriminative genes to classify the TCGA data of 31 different cancer type. SVM also was used in the field [10], where in [26] a new SVM ensemble based on Adaboost (ADASVM) and consistency based feature selection (CBFS) was proposed for leukemia cancer classification, SVM was used to overcome the problems of regular ensemble methods based on decision trees and neural network. Where the authors cited in the former the issue of the tree size and overfitting problem in the latter. Another approach based on Battcharya distance was implemented in [27] for colon cancer and leukemia cancer. The features were selected based on their ranking score, where the genes with larger Battcharya distance are the most effective in classification. Then the subset with the lowest error classification rate is selected as the marker genes. In [28] a shallow neural network was proposed for colon cancer classification with a variation on parameter setting that uses the Monte-Carlo algorithm with SVM theory.

Recently researchers start to apply deep learning in the context [29]. Table 1 illustrates the top recent researches in the literature, where we compared the works based on the used features selection model, the classification model and its accuracy.

Table 1. Deep learning cancer classification recent research. H/L the highest and lowest accuracy score of the classifier depends on the dataset

Fakoor et al. [30] present the use of deep learning for cancer classification through unsupervised features learning. The proposed approach is a two phases process. The feature learning phase, where Principal Component Analysis (PCA) was used for dimensionality reduction. Since PCA is a linear representation of data, some raw features were added to capture the non-linearity of the features. Then sparse auto encoders (Stacked auto encoders in the second test) were used for the unsupervised features selection. In the second phase, the set of learned features with some of the labelled data were passed to the classifier to learn the classifier, as well fine-tuning was used to tune the weights of the features and generalize the features set to adapt to different cancer types.

Bhat et al. [31] used adversarial model based on convolutional neural network and restricted boltzmann machine for gene selection and classification of Inflammatory Breast Cancer. The proposed generative adversarial network (GAN) is a combination of two network. The first network represent a generator that tries to mimic examples (wrong inputs) from the training data set and fed them among the real inputs to the second network. The latter works as a discriminator that tries to distinguish the true inputs from the false ones and classify the samples as accurately as possible. The process continues until the discriminator can no longer distinguish noise input from the real ones. The learnt features are passed to a sigmoid layer for supervised classification.

Danaee et al. [32] proposed stacked denoising auto encoders (SDAE) for breast cancer classification. The paper used SDAE to addresses the high dimensionality and noisy gene expression issues and to select the most discriminative genes in breast cancer classification. The selected genes have been evaluated by ANN and SVM.

In [33], a deep learning approach that combines five classical classification methods was proposed for the classification of lung cancer, stomach cancer and inflammatory breast cancer. The paper used DeSeq for features selection, then the selected features were passed through the five classifiers namely, KNN, SVM, Decision Trees (DTs), Random Forest(RF) and GBDTs in the first classification stage. The output of the first stage is used as the input for a five layer neural network to classify the samples.

4 Deep Forward Neural Network for Cancer Classification

The tackled cancer classification problem can be formulated as follows: Given a matrix \(\{X\}\) of NxM dimension where N represent the number of samples and M is the number of genes, each \(x_{i,j}\) represents the expression level of the gene j related to the sample i, and each sample X is associated to a class that can be either cancerous or not cancerous for binary classification. It can also refer to the the corresponding subtype of the cancer for multiclass classification. Then the problem can be binary classification or multiclass classification.

The architecture is a multilayer feed forward neural network organized as the following:

  • The input layer receives the set of features that represent the gene expression values of each sample.

  • Seven hidden layers have been used. Four are fully connected layers, and between the layers we added three dropout layers that applies a dropout penalty to avoid overfitting.

  • An output layer with a softmax classifier is used to assign the set of received features from the Seventh hidden layer to their corresponding class.

  • We applied a regularization l2() on the input data at the input layer level.

  • For the activation of layers we used the non-linear tanh and relu functions.

figure a

The pseudo-code (Algorithm 1) outlines the different steps of our proposed classifier building. We used batch training to train the network with adamoptimizer and a categorical crossentropy loss. Also, we applied hold-out cross validation (70% training data, 30% testing data) to asses the performance of the classifier. The used performance metrics are accuracy and the loss function where the objective is to maximize the accuracy and minimize the loss without dropping in overfitting and underfitting issues.

For dimensionality reduction we used three methods namely, Kernel Principal Component analysis (KPCA) for non-linear problems, Recursive Feature Elimination (RFE) and Univariate Feature Selection (UFS). In this way we can evaluate the performance of the proposed classifier on different reduced data space.

5 Datasets

The datasets (Table 2) are publicaly available in the GEO bank (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi). They represent the expression level of patient genes that define if the samples are cancerous or not cancerous, the type and the stage of the disease. We applied data preprocessing and imputation on some of the data sets in order to handle the missing values of some genes that appear in few samples.

  • Leukimea Cancer (DS1): The data set is stored under the key GSE15061 [34], it represents a case study of the transformation of leukemia cancer from AML to MDS stage. the samples are all bone marrow distributed as 164 MDS patients, 202 AML patients and 69 non leukemia. The total set is 870 samples with 54613 genes.

  • Inflamatory Breast Cancer (DS2): Stored under the key GSE45581 [35]. The samples are the expression of IBC tumor cells and non-IBC cells. The dataset is a total of 45 samples of Inflammatory Breast Cancer (IBC) and non-IBC with 40991 genes.

  • Lung Cancer (DS3): The dataset is stored under the key GSE2088 [36]. It represents a set of 48 samples of squamous cell carcinoma (SSC), 9 samples of adenocarcinoma and 30 normal lung cancer samples. The total set is 87 samples of 40368 genes.

  • Bladder Cancer (DS4): The access key is GSE31189 [37], it represents the gene expression of human urothelial cells, it contains 52 samples of urothelial bladder cancer patient and 40 non-cancer samples. The set is 92 samples represented through 54675 genes.

  • Thyroid Cancer (DS5): GSE82208 [38], this data set has been used to differentiate between malignant and benign follicular tumours. The set is a collection of 27 samples of follicular thyroid cancer (FTC) and 25 follicular thyroid adenomas (FTA) with the dimensionality of 54675.

Table 2. The data sets description (* preprocessed data set)

6 Results and Discussion

For the aforementioned classical machine learning models (SVM, BN, KNN) we used the scikit-learn python package models, for the shallow network and deep neural network architecture we used sequential model of keras package with tensorflow back-end.

The experimental results (Table 3) shows the variation of the classification accuracy rate, depending on the classifier and the dimensionality reduction method. The obtained results demonstrate the usefulness of supervised machine learning in tumour classification. Yet the results also prove that the deep classifier was able to achieve better performance and score a higher accuracy (up to 100% in different cases) than the classical models.

The proposed DNN model was able to achieve the highest possible accuracy between the classifiers in many situations for the five datasets. Citing the dataset DS4, with the new feature space obtained by univariate feature selection, deep learning overcomes the other classifiers. While in DS1, DS2 respectively DS3, the deep classifier achieved the highest accuracy score in both RFE and UFS. Whereas in DS5, for the three dimensionality reduction models deep learning was able to conquer the other classifiers.

Table 3. Comparative study results in terms of accuracy. Bold values represent the best obtained score.

Compared to SVM and shallow networks, BN and KNN performance was very promising as well. Both classifiers were able to achieve the highest score in three out of five datasets. The Bayes naive classifier performance was at its best with kernel principle components and recursive feature elimination in DS2, DS3, DS4. While KNN performed better with KPCA and UFS in DS1,DS3 and DS5. The overall performance of SVM and shallow network was good yet in the studied cases, it was not good enough compared to the deep classifier performance.

For the case where the proposed classifier was not able to achieve the best accuracy, we believe that an improved architecture (in its density, depth and parameters setting) and a better feature selection model would improve its performance. It is worth noting that the worst cases for the deep network (DS1,DS2,DS3, and DS4) was where we used KPCA as a dimensionality reduction method. This let us to make the assumption that the new feature space was not quite discriminative in order to train the deep classifier to perform accurately.

7 Conclusion

In the era of information and massive datasets, classification and machine learning have been intensively applied by computational, statistical and data analysis researchers to mine, organize, and categorize huge data sets in order to extract a valuable knowledge and acceptable patterns in a variety of field for decades.

Recently with the advances in biological data generation and the migration of biological and medical community toward personalized medicine and cancer advanced treatment systems, scientists start to apply classification and machine learning in order to classify and extract biomarker genes that may help in the therapy process. Through this paper we have seen that machine learning was widely used from the first and classical models to the new deep learning innovation. Therefore we think it may be a key for new achievements in medical informatics. Also the experimental results and the theoretical research mainly in cancer classification problem, have proved to us that every classification model have its strength and weakness and the variation between the performance of each classifier, mainly classical models, depends on the data and the experimental environment. Also we have seen that deep learning is very effective and powerful to handle biological large scale data sets, and was able to conquer other models in their discrimination and classification accuracy. In our future contributions we will try to use deep models for the selection and identification of relevant biomarkers for cancer diagnosis, therapy process.