Keywords

1 Introduction

In recent studies in the field of biological knowledge, more and more experiments have shown that ncRNA plays a vital role in the complex cell processes such as cellular proliferation and differentiation [1], chromatin modification [2], cellular apoptosis and so on [3]. At the meantime, a large number of ncRNA have been discovered with the development of modern advanced science and technology while their functions are not yet exactly known [4]. Therefore, it is imminent to make clear the functions of these ncRNAs. To learn about functions of these ncRNAs, researchers are required to identify whether these ncRNAs were able to interact with other proteins in some process of biological reactions [5,6,7,8,9,10,11]. However, there still are some shortcomings and improved space in the current prediction methods. Therefore, extracting feature information from sequences is a necessary method which can well identify the interactions proved by large number of research between ncRNA and protein [12,13,14,15,16,17,18].

In this study, we put forward a sequence-based method using deep learning model Stacked-autoencoder network combined with Random Forests (RF) classifier. We used K-mers sparse matrices to represent RNA sequences, and then extracted feature vector from matrix by Singular Value Decomposition (SVD). Position Specific Scoring Matrix (PSSM) was used to obtain evolutionary information from each sequence while Bi-gram was further used to get feature vector from PSSM. Then data and label was fed into RF classifier to classify whether a pair of protein and ncRNA interact or not. Furthermore, to evaluate the performance of our approach, five-fold cross validated and two widely used dataset RPI1807 and RPI2241 was used. The experimental results show that our method achieved high accuracy and robustness of the protein-ncRNA interaction prediction tasks.

2 Materials and Methods

2.1 Datasets

We executed experiments on two widely used public datasets including RPI1807 and RPI2241. The dataset RPI1807 consists of 1807 positive ncRNA-protein interaction pairs and 1436 negative ncRNA-protein pairs, including 1078 RNA chains, 1807 protein chains, 493 RNA chains and 1436 protein chains, respectively [19]. It is established by parsing the Nucleic Acid Database (NAD) which provides the RNA-protein complex data and protein-RNA interface database. The RPI2241 dataset is constructed in a similar way, and contains 2241 interacting RNA-protein pairs.

2.2 Features Extraction

To extracted features from ncRNA sequences, k-mer sparse matrix approach was used. A two-dimensional matrix deformation memory to store the features of ncRNA which can express much more useful and significant information such as frequency and location information [20]. An input ncRNA sequence is converted into a \( 4^{k} \times \left( {L - k + 1} \right) \) matrix M can be defined as follow.

$$ M = \left( {a_{ij} } \right)_{{4^{k} }} \times \left( {L - k + 1} \right) $$
(1)
$$ a_{ij} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\,m_{j} m_{j + 1} m_{j + 2} m_{j + 3} = k - mer\left( i \right)} \hfill \\ {0,} \hfill & {else} \hfill \\ \end{array} } \right. $$
(2)

After obtaining the corresponding two-dimensional matrix from the original sequence of ncRNA, we transform this matrix with large amounts of data by way of singular value decomposition (SVD) [21].

And as well, we extracted protein features from the PSSM matrix calculated from the original protein sequence instead using it directly, since the combinations of amino acid cannot all be found in the original protein sequence [22]. To extract the features recognized from the protein fold, we proposed a bi-gram feature extraction technique computed through the representing information mainly contained from PSSM [23].

The bi-gram occurrence matrix B can be calculated as follows and \( b_{m,n} \) be the element in the matrix B:

$$ B = \left\{ {b_{m,n} ,1 \le m \le 20,1 \le n \le 20} \right\} $$
(3)
$$ b_{m,n} = \sum\nolimits_{i = 1}^{r - 1} {p_{i,m} p_{i + 1,n} ,i \le m \le 20,1 \le n \le 20} $$
(4)

where \( b_{m,n} \) can be interpreted as the occurrence probability of the transition from \( m_{th} \) amino acid to \( n_{th} \) amino acid which is able to calculated from the element \( p_{i,j} \) in its PSSM matrix [24]. Let F be the bi-gram feature vector of the protein fold recognition which is as follows:

$$ F = \left\{ {b_{1,1} ,b_{1,2} , \cdots ,b_{1,20} ,b_{2,1} , \cdots ,b_{2,20} , \cdots ,b_{20,1} , \cdots ,b_{20,20} } \right\}^{T} $$
(5)

where the symbol T can be regarded as the transpose of the feature vector [25]. Then, the random forest classifiers were used to predict the interaction between ncRNA and protein.

2.3 Deep Learning Framework Based on Stacked Autoencoder

In order to improve the accuracy of the predicting performance, there had been many recent research which concentrated their attentions on automatic encoders and deep-learning networks [26,27,28,29,30,31,32]. In this study, we used the stacked auto-encoder network for deep learning and classification of training datasets to obtain an efficient deep learning network [33]. A complete stacked auto-encoder network consists of a sparse multilayer neural network auto-encoder which layer inputs can be obtained from the outputs of the previous layers [34]. With the hyper parameter optimization, we were able to get the best parameters of the stacked auto-encoder neural network suitable for our machine learning model [35]. The sparse auto-encoder network which was used to learn the feature changes is a single-layer automatic encoder as follows:

$$ p_{{\left( {\alpha ,\beta } \right)}} \left( x \right) = f\left( {\alpha^{T} x} \right) = f\left( {\sum\nolimits_{i = 1}^{n} {\alpha_{i} x_{i} + \beta_{i} } } \right) $$
(6)

where the input x can be interpreted as the d-dimension dataset and \( f\left( x \right) \) is an activation function. And the auto-encoder network maps X into the output \( p\left( X \right) \). And Sigmoid was selected as activation function as follows:

$$ f\left( y \right) = \frac{1}{{1 + e^{ - y} }} $$
(7)

And consequently, the loss function is as follows:

$$ H\left( {X,\alpha } \right) = \left\| {\alpha p - X} \right\|^{2} + \omega \sum\nolimits_{j} {\left| {p\left( j \right)} \right|} $$
(8)

The stacked neural network architecture is composed of multiple neural network layers which outputs of the previous layers are the inputs of next layers [36]. At the meantime, the keras library from Internet was used to implement stacked auto-encoder and the parameters batch_size and nb_epoch both set to be 100 [37]. The details about keras can be found in website http://github.com/fchollet/keras.

2.4 Stacked Ensemble

In order to find out the solution of assembling mechanism implementing to integrate every individual output from classifiers to implement multi-classifier assembling and obtain an approximately optimal objective function [8, 38,39,40], we regarded the outputs of all level 0 classifiers as predicted probability scores while the successive level 1 classifiers as logistic regression classifiers. The experimental results shown that stacked assembling was equal to the average individual model results strategy when score weights of logistic regression of all individual level 0 classifiers were same.

$$ P_{w} \left( {y = \pm 1\left| s \right.} \right) = \frac{1}{{1 + e^{{ - yw^{T} s}} }} $$
(9)

where s is predicted probability scores of all level 0 classifiers vector outputs and w is the weight vector of corresponding classifiers [41].

3 Experimental Results

The five-fold cross-validation method is used to evaluate the performance of our study, which randomly divides all the data set into five equal parts [42,43,44,45]. We followed the widely used evaluation measures to evaluate our method, including accuracy, sensitivity, specificity, precision and AUC [46,47,48,49,50]. The experimental results in dataset RPI1807 and RPI2241 were shown in Table 1.

Table 1. The experimental results in RPI1807 and RPI2241.

According to the Table 1, our method achieved a decent performance with an accuracy of 0.9600, sensitivity of 0.9344, specificity of 0.9989, precision of 0.9117 and AUC of 0.9920 in testing dataset RPI1807 and an accuracy of 0.9130, sensitivity of 0.8772, specificity of 0.9660, precision of 0.8590 and AUC of 0.9470 in testing dataset RPI2241.

4 Conclusions

In this study, we proposed a sequence-based method using deep learning model Stacked-autoencoder network combined with RF classifier. By employing the k-mers sparse matrix and bi-gram algorithm, the represent ncRNA and protein features were extracted from the corresponding sequence information. In the process of experiments, our method has shown a satisfying performance for predicting RPIs on each reference dataset which thanks to the contribution of the Stacked ensemble autoencoder framework using deep learning. In general, our method tried to extract protein features and automatic learn the advanced features with the use of random forests classifiers, but still do not had a very good breakthrough achievement from the perspective of biology. In future research, we expect to design a better network architecture for extracting hidden advanced features from the perspective of biology.