Keywords

1 Introduction

In recent decades, cancer has become a major public health issue in the world. According to the World Health Organization (WHO), the cancer patient rises to 18.1 million new cases and 9.6 million cancer deaths in 2018. Therefore, more and more studies have been done finding effective solutions to diagnose and treat this disease in recent years. However, there are still many challenges in cancer treatment because possible causes of cancer are genetic disorders or epigenetic alterations in the somatic cells [1]. Moreover, cancer could be known as a disease of altered gene expression. There are many proteins are turned on or off and they dramatically change the basic activity of the cell. Microarray technology enables researchers to investigate and address issues which is once thought to be impractical for the simultaneous measurement of the expression levels of thousands of genes in a single experiment [2]. Information of gene expression profile may be used to find and diagnose diseases or to see how well the body responds to treatment, so many algorithms have been done to analyse gene expression data. During the past decade, many classification algorithms have been used to classify gene expression data, which include support vector machines (SVM) used by [3], neural network in [4], k nearest neighbors (kNN) in [5], C4.5 decision trees (C4.5) in [6], random forests (RF) in [7], decision trees based bagging and boosting style algorithm in [8], bagging of oblique decision stumps (Bag-RODS) and boosting of oblique decision stumps (Boost-RODS) in [9].

In spite of many classification algorithms for gene expression data have risen during recent years but these algorithms remain a critical need to improve classifying accuracy. There are two main research challenges that most state-of-the-art classification algorithms are facing when dealing with gene expression data including very-high-dimensional and small-samples-size. These challenges mean that a characteristic of microarray gene expression data is that the number of variables (genes) n far exceeds the number of samples m, commonly known as “curse of dimensionality” problem. These issues lead to statistical and analytical challenges and conventional statistical methods give improper result due to the high dimension of gene expression data with a limited number of patterns [10]. In practice, it isn’t feasible when to build machine learning model due to the extremely large feature sets with millions of features and high computing cost. In addition, another challenge of gene expression classification model is that training data sample size is relatively small compared to features vector size, therefore the classification models may give poor classification performance due to over-fitting.

In order to solve the issues of classifying gene expression, people often use dimension reduction and enhancing data methods. In recent, the problem very-high-dimensional data can be solve by feature extraction with DCNN [11,12,13,14]. The main advantage of this network is the ability to extract new latent features from the gene expression data, then send them to the classifiers. To tackle the small-sample-size task, many studies have used enhancing methods to improve classification accuracy. SMOTE [15] is a very popular over-sampling method that generates new samples in by interpolation from the minority class. A benefit of using SMOTE is that this algorithm can generate new data from original data and it can use to enhance training sample size. However, this algorithm often use on low-dimensional data [16] because it seems beneficial but less effective for very-high-dimensional data. The cause of this problem is that the interpolation process using k nearest neighbors algorithm. For this reason, it could suffer over-fitting problem for very-high-dimensional data. In practical, this over-sampling algorithm for kNN without variable selection shouldn’t be used because it strongly biases the classification towards the minority class [17]. Therefore, it often is combined with data preprocessing methods including feature selection or feature extraction.

In this paper, we proposed a new learning algorithms for the precise classification of gene expression data of non-linear support vector machine (SVM), linear SVM (LSVM), kNN, RF with SMOTE using features extracted by DCNN (call DCNN-SMOTE-[SVM, LSVM, kNN, RF]). The algorithms perform the training task with three main steps. First of all, we use new DCNN model to extract new features from gene expression data. The new features can improve the dissimilarity power of gene expression representations and thus obtain a higher accuracy rate than original features. Secondly, we propose SMOTE algorithm to enhance gene expression data using new features extracted by extraction model. Two new algorithms are used in conjunction with the classifiers learn to classify gene expression data efficiently. Results of 50 low-sample-size and very-high-dimensional microarray gene expression datasets from Kent Ridge Biomedical [18] and Array Expression repositories [19] illustrate that the proposed DCNN-SMOTE-SVM are more accurate the state-of-the-art classifying models including: SVM [20], LSVM [21], kNN [22], RF [23], C4.5 [24, 25]. In addition, DCNN and SMOTE also improve accurate of classifiers including linear SVM, RF and kNN.

The paper is organized as follows. Section refrela gives a brief overview of DCNN, SMOTE, and our proposed. Section 4 discusses about related works. Section 3 shows the experimental results, and the conclusions are presented in the final section.

2 Related Works

Our proposal is in some aspects related to classification approaches for gene expression data. The first approach is the popular frameworks for gene expression classification involve the main steps as follows: the feature extraction or the feature selection of gene expression, and learning classifiers. [26] applied GA/KNN method to generate the subset of the features and then use kNN algorithm to classify. In recent years, deep convolutional neural network has achieved remarkable results in computer vision [27], text classification [28]. In addition, DCNN is also used for omics, biomedical imaging and biomedical signal processing [29]. The paper of [30] proposed to use deep learning algorithm based on the deep convolutional neural network, for classification of gene expression data. Lyu et al. use DCNN [31] to predict over 11,000 tumors from 33 most prevalent forms of cancer. These algorithms aim reduce dimension of data. More recent DCNN-SVM [13] propose to classify gene expression data. In addition, many other methods have been implemented for extracting only the important information from the gene expression data thus reducing their size [32,33,34,35]. Feature extraction creates new variables as combinations of others to reduce the dimensionality of the selected features.

The second approach use enhancing data methods, then use in conjunction with SVM that efficiently classify small-sample-size data. The paper [36] propose enhancing the gene expression classification of support vector machines with generative adversarial networks. SynTReN algorithm generate gene expression data using network topology method [37]. Moreover, there have been several applications of the enhancing data in bioinformatics, such as [38, 39]. SMOTE is a enhancing data method to generate new data with equal probabilities [15]. In spire of, its behaviour on high-dimensional data has not been thoroughly investigated [40]. The paper [41] is shown in the high-dimensional setting only kNN algorithm based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that feature selection is performed before using SMOTE.

In our algorithm, we take advantages of both approaches DCNN and SMOTE to solve two main issues of classifying gene expression data. Firstly, we use the benefits of DCNN to extract new latent features from gene expression data. This measure is proposed in previous our paper [13] that can address issue very-high-dimension of gene expression. However, we upgrade new architecture of DCNN for gene expression data classification in our approach. The new feature vector size is only approximately 10% compare with origin size. Secondly, we propose SMOTE algorithms to generate new samples from new features extracted by DCNN. The advantages of DCNN for classifying gene expression data are taken advantage when using SMOTE algorithm to generate new data as well as to tackle limit of this algorithm for gene expression data. In addition, in this paper we also apply our model for other algorithms including support vector machine [42], linear SVM [21], k nearest neighbors [22], random forests [23] and decision trees C4.5 [24, 25].

Fig. 1.
figure 1

The workflow of our method

3 Methods

In our study, we use multiple classifying algorithms, SMOTE and DCNN for the precise classification of gene expression data. Our learning approach is composed of three phases that is illustrated in Fig. 1. Firstly, the new DCNN is used to extract new features from gene expression data. Secondly, we use SMOTE algorithm to enhance gene expression data using new features extracted by DCNN. Finally, these algorithms are used in conjunction with the various classifiers learn to classify gene expression data efficiently.

3.1 Feature Extraction Gene Expression Data by DCNN

Fig. 2.
figure 2

A new DCNN architecture for feature extraction in processing gene expression data.

DCNN plays a dominant role in the community of deep learning models [43]. It is a multi-layer neural network architecture that is directly inspired by the visual cortex of the human brain [44]. In network structure, the successive layers are designed to learn progressively higher-level features, until the last layer which produces categories. Once training processing is completed, the last layer, which is a linear classified operating on the features extracted by the previous layers. Although DCNN is the most widely used method in the field of image processing. However, it is a algorithms that is rarely used in gene expression classification.

In order to develop a powerful classifier which can implicitly extract sparse feature relations from an extremely large feature space, we propose to a extraction model based on DCNN, which is one of the state-of-the-art learning techniques. The architecture of this model consists of two convolutional layers, two pooling layers, and a fully connected layer which is shown in Fig. 2. The layers are respectively named CONV1, POOLING1, CONV2, POOLING2, and output (numbers indicate the sequential position of the layers). The input layer receives the gene expression in the 2-D matrix format. We embedded each high-dimensional vector expression data into a 2-D image by adding some zeros at the last line of the image. The first CONV1 layer contains 4 feature maps and kernel size (\(3\times 3\)). The second layer, POOLING1 layer, is taken as input of the average pooling output of the first layer and filter with (\(2\times 2\)) sub-sampling layer. CONV2 uses convolution kernel size (\(3\times 3\)) to output 2 feature maps POOLING2 is a (\(2\times 2\)) sub-sampling layer. We propose to use the Tanh activation function as neurons. The final layer has a variable number of maps that combine inputs from all map in POOLING2. The feature maps of the final sub sampling layer are then fed into the actual classifier consisting of an arbitrary number of fully connected layers. The output layer uses to extract new features from original gene expression data.

3.2 Enhancing Gene Expression Data by SMOTE

Synthetic Minority Over-sampling Technique was first introduced by [15] that is an over-sampling approach. The main idea of this algorithm is that the minority class is over-sampled by creating synthetic examples rather than by oversampling with replacement. It begins with define k nearest neighbors algorithm for each minority sample, than generating synthetic examples duplication through its neighbors as many as the desired percentage among minority class observations. However, this algorithm is often experimented on low-dimensional data [16] in most situations. In fact, it seems beneficial but less effective for very-high-dimensional data. Therefore, it often is combined with data preprocessing methods including feature selection or feature extraction. These methods aim reduce dimension of origin data.

We propose a new SMOTE algorithm (1) that generates new synthetically gene expression data from new features extracted of DCNN. Our algorithm generates synthetic data which has almost similar characteristics of the training data points. Synthetic data points (\(x_{new}\)) are generated in the following way. Firstly, the algorithm takes the feature vectors and its nearest neighbors, computes the distance between these vectors. Secondly, the difference is multiplied by a random number (\(\lambda \)) between 0 and 1, and it is added back to feature vector. This causes the selection of a random point along the line segment between two specific features. Then, linear support vector machine is used to set label for generating samples with constant \(C = 10^3\). An amount of new samples (\(p\%\)) and k nearest neighbors are hyper parameters of the algorithm.

figure a

3.3 Gene Expression Classification of SVM with SMOTE Using Features Extracted from DCNN

The original SVM algorithm was invented by Vapnik [42]. SVM algorithm is systematic and properly motivated by the statistical learning theory. This algorithm is a supervised learning model that is widely applied for classifications and regressions [45].

The SVM algorithm find the best separating plane furthest from the different classes. To achieve this purpose, the SVM tries to maximize the distance between two boundary hyperplanes to reduce the probability of misclassification. The optimal hyperplane found by SVM is maximally distant from the two classes of labeled points located on each side (Fig. 3). In practice, the SVM algorithm gives good accuracy in classifying very-high-dimensional data. Although the SVM is well-known as an efficient model for classifying gene expression data, the small-sample-size training datasets degrade the classification performance of any model [46]. In addition to performing linear classification, the algorithm has been very successful in building highly non-linear classifiers by means of kernel-based learning methods [47]. Kernel-based learning methods aim to transform the input space into higher dimensions, such as a radial basis function (RBF), sigmoid function, and polynomial function. In the proposed approach, a non-linear SVM with an RBF kernel is used for classifying gene expression after extraction feature and enhancing data.

Fig. 3.
figure 3

SVM for binary classification

The proposed algorithm is effective combination of three algorithms DCNN, SMOTE and SVM. The algorithm performs the training task with three main phases (Fig. 1).

First of all, we implement a new DCNN that extract new features from origin gene expression data. Our model has take advantage of DCNN is that this model can learn latent features from very-high-dimensional input spaces. This process can be viewed as projection of data from higher dimensional space to a lower dimensional space. Moreover, these new features we can improve the dissimilarity power of gene expression representations and thus obtain higher accuracy rate than original features.

Although the data dimension has reduced but training data sample size is relatively diminutive compared to feature vector size, so that classifiers may give poor classification performance due to over-fitting. In a second order phase training, our model use SMOTE generates new sample from features extracted by DCNN model. In our approach, in the very-high-dimensional data setting only kNN classifiers based on the Euclidean distance seem to benefit substantially from the use of over-sampling, provided that feature extraction by extraction model is performed before using this algorithm. For traditional over-sampling algorithm, it is not effective for very-high-dimensional data and this problem has tackled by DCNN model in our approach.

Last but not least, our model generates new training data following which the classifiers learns to classify gene expression data efficiently in this phase. The classifiers consist non-linear SVM, linear SVM, kNN, RF and C4.5 that are used to classify new data. In our approach, we propose to use RBF kernel type in SVM model because it is general and efficient [45]. Moreover, the combination of DCNN and SMOTE can improve accuracy classification of linear SVM, RF and kNN.

4 Evaluation

We are interested in the classification performance of our proposal for gene expression data classification. Therefore, we report the comparison of the classification performance obtained by our model and the best state-of-the-art algorithms including support vector machine (SVM) [42], linear SVM (LSVM) [21], decision trees (C4.5) [25], k nearest neighbors (kNN) [22], random forests (RF) [23].

In addition, we also compare various version of DCNN-SMOTE (DCNN-SMOTE \(\rightarrow \) [SVM, LSVM, RF, kNN, C4.5]) with SVM, LSVM, RF, kNN, C4.5. These results are used so as to evaluate performance of classifiers after using DCNN-SMOTE. Moreover, we interested in the effective of enhancing model, therefore we also evaluate the algorithms classification using features extracted by extraction model (DCNN \(\rightarrow \) [SVM, LSVM, RF, C4.5, kNN]), then they are compared to our proposed.

In order to evaluate the effectiveness in classification tasks, we have implemented DCNN-SMOTE-SVM and its version in Python using Scikit [48] and TensorFlow [49] libraries. Other algorithms like RF, C4.5 in Scikit library. We use the highly efficient standard SVM algorithm LibSVM [21] with one-versus-one strategy for multi-class. We used the Student’s test to assess classification results of learning algorithms.

All tests were run under Linux Mint on a 3.07 GHz Intel(R) Xeon(R) CPU PC with 8 GB RAM.

Table 1. Description of microarray gene expression datasets

4.1 Experiments Setup

Experiments are conducted with fifty very-high-dimensional datasets from the Biomedical [18] and Array Express repositories [19]. The characteristics of datasets are summarized in Table 1.

The evaluation protocols are illustrated in the last column of Table 1. With datasets having training set (trn) and testing set (tst) available, we use the training data to tune the parameters of the algorithms for obtaining a good accuracy in the learning phase. Then the obtained model is evaluated on the test set. With a datasets having less than 300 data points, the test protocol is leave-one-out cross-validation (loo). For the others, we use 10-fold cross-validation protocols remains the most widely to evaluate the performance [50]. The total classification accuracy measure is used to evaluate the classification models.

Table 2. Hyper-parameters of DCNN-SMOTE-SVM

As for training model, we tune the parameters for three algorithms including DCNN, SMOTE and the parameters of classifiers.

In order to train network, we use Adam for optimization [51] with batch size is 8 to 32. We start to train with a learning rate of 0.00002 for all layers, and then rise it manually every time when the validation error rate stopped improving. Cross entropy is used to define a loss function in DCNN. The number of epochs is 200.

In our algorithm, the number of neighbors (k) is chosen in 1, 3, 5, 7, 9. The samples were over-sampled (p) at 100%, 200% and 300% of its original samples size. We tune the hyper-parameter \(\gamma \) of RBF kernel and the cost C (a trade-off between the margin size and the errors) to obtain the best correctness. The cost C is chosen in \({1, 10, 10^2, 10^3, 10^4, 10^5}\), and the hyper-parameter \(\gamma \) of RBF kernel is tried among \({1.E-1, 1.E-2, 1.E-3, 1.E-4, 1.E-5}\). All the optimal parameters is show in Table 2.

With other algorithms, the cost constant C of linear SVM is set to \(10^3\). In non-linear SVM, we adjust the hyper-parameter \(\gamma \) and the cost C to get the best result. RF learns 200 decision trees to classify all datasets. kNN tries to use k among \(\{1, 3, 5, 7\}\).

Table 3. Classification results of 15 models on 50 datasets (%)

4.2 Classification Results

Table 3 gives results of classifying algorithms on 50 gene expression datasets. The best results are bold faces and the second ones are italic. The plot charts in Figs. 4, 5 and 6 also visualise classification results. Table 4 summarizes results of these statistical tests with paired Student ratio test present the mean accuracy of these models.

First and foremost, we evaluate feature extraction and enhancing data algorithms. We compare the accuracy of classifying algorithms (SVM, LSVM, kNN and RF) and various versions of DCNN-SMOTE including (DCNN-SMOTE-SVM, DCNN-SMOTE-LSVM, DCNN-SMOTE-kNN and DCNN-SMOTE-RF).

At first sight, Tables 3, 4 and Fig. 6 show that DCNN-SMOTE-SVM, DCNN-SMOTE-LSVM, DCNN-SMOTE-kNN, DCNN-SMOTE-RF significantly increases the mean accuracy of 4.83, 3.37, 2.9, 2.08% points compared to SVM, LSVM, kNN and RF respectively. All p-values are less than 0.05. In detail, DCNN-SMOTE-SVM has good performances compared to SVM with 29 wins, 11 ties, 10 defeats, p-value = 1.33E-03 as well as DCNN-SMOTE-LSVM has 34 wins, 7 ties, 9 defeats (p-value = 8.72E-03) compared to LSVM. In the comparison to kNN, DCNN-SMOTE-kNN outperforms 29 out of 50 datasets (29 wins, 4 ties, 17 defeats, p-value = 2.26E-03). Besides, DCNN-SMOTE-RF has 29 wins, 12 ties, 9 defeats (p-value = 2.78E-02) compared to RF. These results show effective of DCNN and SMOTE that improve accuracy of SVM, LSVM, RF and kNN classifiers. In the comparison between DCNN-SMOTE-C4.5 with C4.5, DCNN-SMOTE-C4.5 slightly superior to decision tree of C4.5 with 27 wins, 2 tie, 21 defeat, p-value = 1.06E-01 (not significant different).

In addition, it is clear that DCNN-SMOTE-SVM shows the best performance. Tables 3 and 4 show that it significantly improves the mean accuracy of 4.82, 3.53, 9.40, 5.55, 12.48% points compared to SVM, LSVM, kNN, RF and C4.5 respectively. All p-values are less than 0.05. In detail, it has 29 wins, 10 ties, 11 defeats (p-value = 1.33E-03) against SVM and 33 wins, 11 ties, 6 defeats (p-value = 4.68E-04) compared to LSVM. This model has also 48 wins, 1 tie, 1 defeat (p-value = 5.57E-09) compared to kNN and 41 wins, 5 ties, 4 defeats (p-value = 6.01E-09) compared to RF. In the comparison to C4.5, our model outperforms 46 out of 50 datasets (46 wins, 1 ties, 3 defeat, p-value = 3.12E-12).

Moreover, DCNN-SMOTE-SVM model efficiently classify more than various versions including DCNN-SMOTE\(\rightarrow \)[LSVM, kNN, RF and C4.5]. In detail, this model gives good performances compared to DCNN-SMOTE\(\rightarrow \)[LSVM, kNN, RF and C4.5] which improves the mean accuracy of 0.63, 5.67, 3.47, 9.7 respectively.

Furthermore, DCNN, SMOTE models enhance the accuracy of classifiers compared to the algorithms classifications using the features extraction from DCNN. It is clear that DCNN-SMOTE\(\rightarrow \)[SVM, LSVM, kNN, RF, C4.5] increase the mean accuracy of 0.98, 1.09, 3.15, 1.06, 1.00% points compared to DCNN \(\rightarrow \)[SVM, LSVM, kNN, RF, C4.5]. These results show using DCNN and SMOTE is effectively more than our paper previous [13].

The running time of a our model includes three parts: the time to train the deep convolutional networks for extracting the features, the time to generate new samples and the training time for the classifier on the new data. The average time of the first part on 50 datasets is 48.26 s. The average time of the second part is 29.37 s. Finally, the average time of the second part for SVM, kNN, LSVM, RF and C4.5 in the our model are, respectively, 0.91, 0.08, 1.83, 2.75 and 1.37 s. While the running time of SVM, kNN, LSVM, RF and C4.5 are 8.48, 0.31, 54.85, 10.76 and 6.3 s.

These experiments allow us to believe that our approach efficiently handle gene expression data with the small sample size in very-high-dimensional input. Moreover, the combination of DCNN and SMOTE is not only improve performance of non-linear SVM and but also linear SVM, kNN and random forests.

Table 4. Summary of the accuracy comparison
Fig. 4.
figure 4

Comparison the accuracy of DCNN-SMOTE-SVM and SVM, DCNN-SMOTE-LSVM and LSVM on 50 datasets (%).

Fig. 5.
figure 5

Comparison the accuracy of DCNN-SMOTE-RF and RF, DCNN-SMOTE-kNN and kNN on 50 datasets (%).

Fig. 6.
figure 6

Comparison of the mean accuracy of the classification models

5 Conclusion and Future Works

We have presented a new classification algorithm of multiple classifiers with SMOTE using features extracted by DCNN that tackle with the very-high-dimensional and small-sample-size issues of gene expression data classification. A new DCNN model extract new features from origin gene expression data, then a SMOTE algorithm generates new data from the features of DCNN was implemented. These models are used in conjunction with classifiers that efficiently classify gene expression data. From the obtained results, it is observed that DCNN-SMOTE can improve performance of SVM, linear SVM, random forests and k nearest neighbors algorithms. In addition, the proposed DCNN-SMOTE-SVM approach has the most accurate, when compared to the than the-state-of-the-art classification models in consideration.

In the near future, we intend to provide more empirical test on large benchmarks and compare to other algorithms. A promising future research aims at automatically tuning the hyper-parameters of our algorithms.