Keywords

1 Introduction

Traditional neural networks generally consist of three layers: the first indicates the data entries, the second is the hidden layer, and the third corresponds to the output layer. When the architecture of the neural network has more than three layers, it is commonly referred to as deep neural network. The most representative example of this architecture is the multi-layer perceptron with many hidden layers, where each layer trains a different set of features based on the output of the previous layer [1, 2].

Deep learning algorithms have usually been applied to problems whose complexity is high due to the amount of data stored, that is, there is a large number of features and samples. They have been used extensively in various scientific areas to tackle very different problems [3, 4]. The main advantages of this type of neural networks are three-fold: high performance, robustness to overfitting, and high processing capability.

In this work, we analyze the performance of several deep neural networks and other machine learning models in the classification of gene-expression microarrays, which are characterized by a very large number of features coupled with a small number of samples. This could represent a challenging situation because typical applications with deep neural networks refer to problems in which both the dimensionality and the number of samples are high. Therefore, the purpose of this paper is to investigate the efficiency of deep learning algorithms when applied to data sets with those especial characteristics, thus checking whether or not they perform as good as in those applications where they have demonstrated to behave significantly better than state-of-the-art algorithms.

2 Related Works

Nowadays, the use of deep learning to solve a variety of real-life problems has attracted the interest of many researchers because these algorithms allow to obtain generally better results than traditional machine learning methods [5]. As already mentioned, deep neural networks consist of a very large number of hidden layers, which lead to high computational cost when processing data of large size and high dimension.

The areas in which deep neural networks have been most applied are image recognition and natural language processing. For instance, Cho et al. [6] employed a recurrent neural network (RNN) encoder-decoder to detect semantic and syntactic representations of language when translating from English into French, thus obtaining a better translation in the analyzed sentences. The analysis of information to recognize translations, dialogues, text summaries and text produced in social networks was studied using techniques such as the convolutional neural network (CNN) and the RNN [7]. Nene [8] reviewed the developments and applications of deep neural networks in natural language processing.

In image processing, the use of deep neural networks makes tasks faster and allows to obtain better results. Dong et al. [9] proposed a CNN approach to learn an end-to-end mapping between low- and high-resolution images, performing better than the state-of-the-art methods. On the other hand, Wen et al. [10] combined a new loss function with the softmax loss to jointly supervise the learning of a CNN for robust face recognition. Gatys et al. [11] showed how the generic feature representations learned by high-performing CNNs can be used to independently process and manipulate the content and the style of natural images. A deep neural network based on bag-of-words for image retrieval tasks was proposed by Bai et al. [12]. A novel maximum margin multimodal deep neural network was introduced to take advantage of the multiple local descriptors of an image [13].

Apart from image and natural language processing, deep neural networks have also been applied to some other practical domains. For instance, Langkvist et al. [14] reviewed the use of deep learning for time-series modeling and prediction. Hinton et al. [15] presented an overview of the application of deep neural networks to acoustic modeling in speech recognition. Noda et al. [16] utilized a deep denoising autoencoder for acquiring noise-robust audio features and a CNN to extract visual features from raw mouth area images. Wang and Shang [17] employed deep belief networks to extract features from raw physiological data. Kraus and Feuerriegel [18] studied the use of deep neural networks for predicting stock market movements subsequent to the disclosure of financial materials. Heaton et al. [19] introduced an autoencoder-based hierarchical decision model for problems in financial prediction and classification.

The biomedical domain is another scientific area where the use of deep learning is gaining much attention in last years. For instance, Maqlin et al. [20] proposed the application of the deep belief neural network to determine the nuclear pleomorphism score of breast cancer tissues. Danaee [21] used a stacked denoising autoencoder for the identification of genes critical for the diagnosis of breast cancer. Abdel-Zaher and Eldeib [22] presented an automatic diagnosis system for detecting breast cancer based on deep belief network unsupervised pre-training phase followed by a supervised back-propagation neural network phase. Hanson et al. [23] implemented deep bidirectional long short-term memory recurrent neural networks for protein intrinsic disorder prediction. Salaken et al. [24] designed an autoencoder for the classification of pathological types of lung cancers. Geman et al. [25] proposed the application of deep neural networks for the analysis of large amounts of data produced by the human microbiome. Chen et al. [26] developed an incremental RNN to discriminate between benign and malignant breast cancers.

3 Deep Learning Methods

In this section, the deep neural networks that will be further used in the experiments are briefly described.

3.1 Multilayer Perceptron

The multilayer perceptron (MLP) constitutes the most conventional neural network architectures. These are commonly based on three layers: input, output, and one hidden layer. Nevertheless, the MLPs can also be translated into deep neural networks by incorporating more than two hidden layers in its architecture; this allows to reduce the number of nodes per layer and use less parameters, but in turn this leads to a more complex optimization problem [1, 25].

In deep MLP networks, each layer trains with a different set of features, which are based on the output of the previous layer. It is possible to select features in a first layer and the outputs of this will be used in the training of the next layer.

3.2 Recurrent Neural Network

Recurrent neural networks are a type of network for sequential data processing, allowing to scale very long and variable length sequences [1]. In this type of network, a neuron is connected to the neurons of the next layer, to those of the previous layer and to it by means of the weights, values that change in each time step.

The recurrent neural networks can adopt different forms depending on the particular design:

  • Networks that produce an output in each time step with recurring connections between the hidden units.

  • Networks that produce an output and have recurring connections only from the output to the hidden unit of the next step.

  • Networks with recurring connections between hidden units that read the complete sequence of data and produce a simple output.

A design that improves the use of recurrent neural networks is based on LSTM units, thus giving solution to the problem of the vanishing gradient that occurs in a conventional recurrent network. This means that the gradient changes the weights with respect to the change of the error. If the gradient is not known, then it is not possible to adjust the weights in the direction of decreasing the error, which causes the network to stop learning; this happens because the processed data go through many stages of multiplication.

Figure 1 shows the structure of the recurrent neural network working with LSTM cells, where x are the inputs, y are the outputs, and s consists of the values that the cells take. Unlike the bidirectional recurrent neural network, which works with both forward and backward propagation (see Fig. 2), the recurrent neural network works only with forward propagation.

Fig. 1.
figure 1

Recurrent neural network with LSTM

An LSTM contains information in a closed cell independent of the flow of the neural network. This information can be stored, written or read, which helps to preserve the error that can be propagated back to the passage of the layers. If the error remains constant, this allows the network to continue learning over time. The cell of the LSTM decides when to store, write or erase by means of gates that open and close analogically, which act by signals; this allows to adjust the weights by decreasing the gradient or to propagate the error again [27].

The basic idea of the LSTM is very simple: some of the units are called constant error carousels, which are used as an activation function (an identity function) and have a connection to itself with a fixed weight of 1.0 [2].

3.3 Bidirectional Recurrent Neural Network

Bidirectional recurrent neural networks are a type of network where a recurrent network is used with forward propagation and another with backward propagation. This type of network is used for input data sequences where it is known its beginning and end (e.g., spoken sentences and protein structures). To know the past and future of each sequence element, a recurrent network processes the sequence of data from the beginning to the end, and another processes backing up from the end to the beginning [2].

Fig. 2.
figure 2

Bidirectional recurrent neural network with LSTM

3.4 Autoencoder

An autoencoder is a type of neural network that copies the input to the output. It consists of an encoder that does the training task and a decoder that obtains the same inputs as outputs. In general, it can be used for feature selection, dimensionality reduction and classification [1].

There are different types of autoencoders, which can make different tasks depending on the structure of them:

  • Incomplete autoencoder: wait for the results of the training, from where it takes useful features that result from restricting h less to x, where h are the nodes of the encoder and x are the inputs.

  • Regularized autoencoder: this type uses a loss function that allows to have other properties in addition to copying the input to the output.

  • Dispersed autoencoder: a training dispersion penalty is applied; it is used to learn functions used in classification tasks.

  • Autoencoder for elimination of noise: it obtains useful characteristics minimizing the reconstruction error, this receives a damaged data set and is trained to predict the original data set not damaged as an output.

4 Experimental Set-Up

The purpose of the experiments in this work is to compare some state-of-the-art machine learning algorithms with deep learning for the classification of gene-expression microarrays. To this end, a collection of publicly available microarray cancer data sets taken from the Kent Ridge Biomedical Data Set Repository (http://datam.i2r.a-star.edu.sg/datasets/krbd) were used (see Table 1).

Table 1. Description of the data sets. The imbalance ratio (IR), which corresponds to the ratio of the majority class size to the minority class size is reported in the last column

For the experimental design, we adopted the holdout method 10 times was adopted, with 70% of the samples for training and 30% for testing. The traditional machine learning methods used in these experiments were the radial basis function (RBF) neural network, the random forest (RNDF), the nearest neighbor (1NN) rule, the C4.5 decision tree, and a support vector machine (SVM) using a linear kernel function with the soft-margin constant \(C=1.0\) and a tolerance of 0.001. The deep learning models analyzed in this work were recurrent neural network (RNN), bidirectional recurrent neural network (BRNN) and autoencoder (AE). In addition, we included two versions of MLP: one with two hidden layers (MLP2) and one with three hidden layers (MLP3). The main parameters of the deep neural networks are listed in Table 2.

Table 2. Parameters of the deep neural networks

The state-of-the-art machine learning methods were applied using the default parameters as defined in the WEKA data mining toolkit [28].

5 Results

Table 3 reports the accuracy results and standard deviations for each classifier and each database. In addition, the Friedman’s average rankings are also included. Bold values indicate the best model for each data set.

Table 3. Accuracy results (and standard deviation) for the classifiers
Table 4. Wilcoxon’s paired signed-rank test (\(\alpha = 0.05\))

From the Friedman’s rankings, one can see that the best algorithms were MLP2 and AE followed by the classical random forest, whereas the two versions of recurrent neural networks (RNN and BRNN) performed the worst in average. When focusing on the accuracy results on each particular database, it was found that the autoencoder was the best method in four out of the eight problems (Lung-Michigan, Lung-Ontario, Ovarian, and Colon), and the MLP2 model was the best performing algorithm in two cases (Prostate and Breast).

It is worth noting that Lung-Michigan, Lung-Ontario and Ovarian, which correspond to three of the databases where the AE method performed the best, are the cases with the highest imbalance ratio as reported in Table 1. On the other hand, the only problem where a state-of-the-art machine learning method achieved the best accuracy was CNS, which is one of the databases with the smallest number of samples and features.

To check the results of the classifiers and to determine whether or not there exist significant differences between each pair of algorithms, the Wilcoxon’s paired signed-rank test at a significance level of \(\alpha = 0.05\) was employed. This statistic ranks the differences in performance of two algorithms for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. In Table 4, one can see the results of this test where the symbol “\(\bullet \)” represents that the classifier in the column was significantly better than the classifier in the row, whereas the symbol “\(\circ \)” indicates that the classifier in the row performed significantly better than the classifier in the column.

6 Conclusions

In this paper, we have carried out an empirical comparison between several deep neural networks and some traditional machine learning methods for the classification of gene-expression microarray data, which characterize by a large number of samples and a very small number of features. While deep learning has demonstrated to be a powerful tool in applications with a huge amount of both samples and features, there is no study in problems that suffer from the “curse of dimensionality” phenomenon, such as is the case of gene-expression microarray analysis.

The experimental results have shown that the autoencoder and an MLP with two hidden layers were the best performing deep neural networks. On the other hand, it has also observed that there is no single method with the highest accuracy on all databases, and even the SVM (a traditional machine learning algorithm) was superior to the remaining models on one problem. Another interesting finding is that the recurrent neural networks were the worst techniques in average.