Keywords

1 Introduction

Nowadays, the development of high-throughput technologies such as DNA microarray has led to incremental growth in the public databases such as the ArrayExpress [1] and NCBI Gene Expression Omnibus [2]. Microarray is technology which enables researchers to investigate and address issues which is once thought to be non traceable by facilitating the simultaneous measurement of the expression levels of thousands of genes in a single experiment [3]. A characteristic of microarray gene expression data is that the number of variables (genes) m far exceeds the number of samples n, commonly known as curse of dimensionality problem. The vast amount of gene expression data leads to statistical and analytical challenges and conventional statistical methods give improper result due to high dimension of microarray data with a limited number of patterns [4]. It is not feasible when build machine learning model due to the extremely large features set with millions of features and high computing cost.

With the wealth of gene expression data from microarrays being produced, more and more new prediction, classification, and clustering techniques are being used for the analysis of the data. Many methods have been used for microarray gene expression data classification, and typical methods are support vector machines (SVM) [5,6,7,8], k-nearest neighbor classifier [9], C4.5 decision tree [10,11,12] and ensemble methods, such as random forests [13], random forests of oblique decision trees [14], bagging and boosting [15, 16].

In recent years, convolutional neural networks (CNNs) have achieved remarkable results in computer vision [17], text classification [18]. In addition, CNNs is also used for omics, biomedical imaging and biomedical signal processing [19]. Most data in bioinformatics are raw data such as gene sequences, proteins, microarray, medical image. Conventional machine learning algorithms have limitations in processing the raw form of data, so hybrid models often are used to combine the advantage of features extraction from the raw data of CNNs and performance classification of SVM or random forests (RF). The hybrid model neural network and SVM was initially proposed in [20]. In [21], model is later proposes in for handwritten digit recognition. More relevant previous work include [22], where a hybrid model approach is presented: the CNNs has trained using the back-propagation algorithm and the SVM is trained using a non-linear regression approach. It is noticeable that error classification rate gained by the hybrid model has achieved better results. In [23], the hybrid model uses for recognition for mobile swarm robotic systems. In addition, CNNs and RF are also combined to build hybrid model for electron microscopy images segmentation [24].

In this paper, we propose a hybrid model combining DCNNs and SVM (called DCNN-SVM) to effectively classify very-high-dimensional gene expression data. The main idea of our approach is to train a specialized DCNNs to extract robust hierarchical features from microarray gene expression data (MGE data) and provide them to SVM classifier using radial basis function kernel (RBF). Our approach differs from these previous ones as we build a single model instead of using disjoint classifiers trained separately. In relevant previous work, the CNNs is trained using the back-propagation algorithm and the SVM is trained using a non-linear regression approach, linear kernel function and random forest. The data in the relevant previous work was image such as: handwritten digit, medical image and video.

We have used 15 datasets of ArrayExpress [1] and Biomedical repository [25] to evaluate our model and also to compare to traditional classification methods such as DCNNs, support vector machines [26] and random forests [27]. The results showed that DCNN-SVM extract robust hierarchical features and improves classification accuracy. Our method shows an excellent performance in general with support vector machines classifier using radial basis function kernel.

The paper is organized as follows. Section 2 presents our approach, a hydrid model combining DCNNs and SVM. Section 3 shows the experimental results. We then conclude in Sect. 4.

2 Methods

2.1 Deep Convolutional Neural Networks

DCNNs are designed to process multiple data types, especially two-dimensional images, and are directly inspired by the visual cortex of the brain. In the visual cortex, there is a hierarchy of two basic cell types: simple cells and complex cells [28]. Simple cells react to primitive patterns in sub-regions of visual stimuli, and complex cells synthesize the information from simple cells to identify more intricate forms. Since the visual cortex is such a powerful and natural visual processing system, DCNNs are applied to imitate three key ideas: local connectivity, invariance to location, and invariance to local transition [29]. There are three main types of layers used to build DCNNs architectures: convolutional layer, pooling layer, and fully connected layer. Normally, a full DCNNs architecture is obtained by stacking several of these layers. In a DCNNs, the key computation is the convolution of a feature detector with an input signal. Convolutional layer computes the output of neurons connected to local regions in the input, each one computing a dot product between their weights and the region they are connected to in the input volume. The set of weights which is convolved with the input is called filter or kernel. Every filter is small spatially (width and height), but extends through the full depth of the input volume. For inputs such as images typical filters are small areas and each neuron is connected only to this area in the previous layer. The weights are shared across neurons, leading the filters to learn frequent patterns that occur in any part of the image. The distance between the applications of filters is called stride. Whether stride hyper parameter is smaller than the filter size the convolution is applied in overlapping windows.

2.2 Support Vector Machines

Support vector machines (SVMs) proposed by Vapnik [26] are systematic and properly motivated by statistical learning theory. SVMs are the most well known as class of learning algorithms using the idea of kernel substitution. SVM and kernel-based methods have shown practical relevance for classification, regression [30]. The SVM algorithm is to find the best separating plane furthest from the different classes. In order to achieve this purpose, a SVM algorithm tries to simultaneously maximize the margin (the distance between the supporting planes for each class) and minimize the error (any point falling on the wrong side of its supporting plane is considered to be an error). For binary classification problem (see Fig. 1), samples of one class are located on one side of the hyper-plane while samples of the other class are located on the other side of the hyper-plane.

Fig. 1
figure 1

Linear separation of the datapoints into two classes

For multiclass, one-versus-all [26], one-versus-one [31] are the most popular methods due to their simplicity. Let us consider k classes (\(k>2\)). The one-versus-all strategy builds k different classiers where the ith classier separates the ith class from the rest. The one-versus-one strategy constructs \(k(k -1)/2\) classiers, using all the binary pairwise combinations of the k classes. The class is then predicted with a majority vote.

SVM can use some other classification functions, for example a polynomial function of degree d, a radial basis function (RBF) or a sigmoid function. More details about SVM and other kernel-based learning methods can be found in [32].

2.3 Support Vector Machines Using the Feature Extraction from Deep Convolutional Neural Networks

DCNNs are efficient at learning invariant features from data, but do not always produce optimal classification results. Conversely, a non-linear SVM cannot learn complex invariances, but produce good decision surfaces by maximizing margins using soft-margin approaches [33].

Our investigation is to propose a hybrid model architecture: A coupling SVM with the feature learning of DCNNs (denoted by DCNN-SVM) for classifying microarray gene expression data. The training task of DCNN-SVM consists of two main steps. First, the algorithm learns DCNNs to deeply extract functional features from high dimensional gene expression profiles. Next, it trains non-linear SVM models to perform the classification of the data representation extracted by the previous one.

The network architecture is shown in Fig. 2. Firstly, the first layer uses gene expression data. Secondly, the second and fourth layers of the network are convolution layers alternator with sub-sampling layers, which take the pooled maps as input. Consequently, they are able to extract features that are more and more invariant to local transformations of the input layer. The sixth layer is fully connected layer. The final layer is substituted by SVM with the RBF kernel for classification. The outputs from the hidden units are taken by the SVM as a feature vector for the training process. After that, the training stage continues till realizing good trained. Finally, classification on the test set is performed by the SVM classifier with such automatically extracted features.

Fig. 2
figure 2

The DCNN-SVM architecture

3 Evaluation

We implement DCNN-SVM, SVM and random forests in python, using library SVM, LibSVM [34], tensorflow [35] and scikit library [36]. All tests were run under Linux Mint on a single 2.4 GHz Core I3 PC with 8 GB RAM.

Table 1 Description of microarray gene expression datasets

3.1 Experiments Setup

In our experiments, we use datasets provided by ArrayExpress database [1] and the Medical Database (Kent Ridge) [25]. ArrayExpress archive of Functional Genomics Data stores data from high-throughput functional genomics experiments. We downloaded MGE datasets from the ArrayExpress. The criteria for selecting the datasets were that the experiments had been conducted in humans and in the field of cancer. Datasets published or updated after 2012 and provided processed data. To reduce the source of variability of classification model performances because of the array used in the experiments, we retained studies conducted with Affymetrix array. The datasets and their characteristics are summarized in Table 1.

The test protocols are presented in the column 5 of Table 1. Some datasets are already divided in training set (Trn) and testing set (Tst). For these datasets, we used the training data to build the our model. Then, we classified the testing set using the resulted model. With a datasets having less than 300 data points, the test protocol is leave-one-out cross-validation (loo). For the others, we used 10-fold cross-validation protocols remains the most widely to evaluate the performance [42]. Our evaluation used on the classification accuracy.

The DCNN-SVM architecture is shown in Table 2. It consist of 2 convolutional layers with 32 and 16 feature maps of \((3\times 3)\) kernel, and each convolutional layer has a \((2\times 2)\) average pooling layer followed. The features are taken from the last fully connected layer. SVM takes these outputs from the fully connected for classification. The one-versus-all method is utilized for the multi-class SVM that is possibly to be viewed as a trainable feature extractor. We have also tried other configurations of CNN, whereas this one gives the best performance. Input data are transformed the following way: we use microarray expression feature to represent each sample patient, which transform into a feature matrix. For deep convolutional neural networks configurations, we use ADAM method [43] for optimization, cross-entropy for loss function. The batch size is set to 16 and 50 epochs are used. We also tried to tune activation function with ReLU, Tanh and Sigmoid. The Tanh activation works better than other activation functions for microarray gene expression data.

We propose to use RBF kernel type in SVM models because it is general and efficient [44]. We also tried to tune parameters \(\gamma \) of RBF kernel and the cost C (a trade-off between the margin size and the errors) to obtain a good accuracy. These parameters are presented in Table 2.

Table 2 Hyper-parameters of SVM used in DCNN-SVM

In order to evaluate the effectiveness of our approach, we used two different experiments to classify microarray samples. First, we compare DCNN-SVM with SVM, random forests (RF) and traditional DCNNs. In this experiments, RF algorithms build 200 decision trees and we use linear kernel type in SVM models (\(C=10^{5}\), \(\gamma =0.01\)). Second, we compare different kernel functions in the SVM classifier: a linear kernel (DCNN-SVM linear) and a radial basis function (DCNN-SVM) with best parameter in Table 2. In addition, we also compared DCNN-SVM with DCNNs using random forest (DCNN-RF) classifier.

Table 3 Classification results in terms of accuracy (%)
Fig. 3
figure 3

Comparison of accuracy (%)

3.2 Experiments Results

Numerical test results on 15 microarray datasets are shown in Table 3. Results on 15 datasets showed that DCNN-SVM is more accuracy than the classical DCNNs algorithm, SVM, random forests. DCNN-SVM has the best accuracy of 11 out of 15 datasets. SVM and RF have the best only 1 out of 15 datasets. Table 3 and Fig. 3 showed that DCNN-SVM uses the RBF kernel to achieve the best accuracy result of 11 over 15 datasets. The DCNN-SVM uses linear kernel to achieve the best accuracy of 6 out of 15 datasets and DCNN-RF uses RF classifier has the best accuracy of 5 out of 15 datasets. DCNNs has the best accuracy of 4 out of 15 datasets. This superiority of DCNN-SVM (RBF) on CNNs, DCNN-SVM (RF) and DCNN-SVM (linear) showed in table results: 5 wins of DCNN-SVM (RBF) on DCNN-SVM (linear), 10 wins of DCNN-SVM (RBF) on DCNN-SVM (RF) and DCNNs on 15 datasets.

4 Conclusion and Future Works

We have presented a hybrid model combining DCNNs and SVM to classify very-high-dimension microarray gene expression data. The features are learned through a convolution process and then sent as input to a SVM classifier using RBF kernel to the objective of interest. After modifications through specified hyper parameters, the model performs quite comparatively well on the task tested on 15 different datasets from ArrayExpression and Medical Database. The numerical test results show that our proposal is more accurate than the classical DCNNs algorithm, support vector machines, random forests for classifying.

In the near future, we intend to provide more empirical test on large datasets of microarray gene expression and comparisons with other algorithms. Our proposal can be effectively parallelized. A parallel implementation that exploits the multicore processors can greatly speed up the learning and predicting tasks.