1 Introduction

1.1 Motivation

Long noncoding RNAs (lncRNAs) are a class of long endogenous noncoding RNAs with poor sequence conservation [1,2,3]. lncRNAs have close association with multiple key biological processes [4]. More importantly, increasing works imply that lncRNAs also densely linking with many complex diseases [5, 6], for example, brachydactyly syndrome and HELLP syndrome [7], facioscapulohumeral muscular dystrophy [8], fat [9], and cancers. For example, lncRNAs HOXA-AS2 and SNHG12 are identified as possible therapeutic targets and biomarkers in human cancers [10, 11], DLEU1 densely links with colorectal cancer progression through the activation of KPNA3 [12], HOTAIR’s expression is elevated in lung cancer [13], ZFAS1 has close relationship with cervical cancer cell chemosensitivity [14]. In summary, lncRNAs have been increasingly confirmed to be tumor-related biological molecules. However, to date, relationships between lncRNA and known tumor-suppressive entities remain largely elusive. Evidence indicates that lncRNAs exert their biological functions based on the linkages with RNA-binding proteins. Therefore, the identification of potential lncRNA–protein interactions (LPIs) contributes to understand many important biological processes and progression and metastasis of various complex diseases.

1.2 Related Work

Wet-lab experiments for LPI identification are time-consuming and waste of sources. Computational methods have been gradually explored for potential LPI discovery. Existing computation-based LPI prediction methods can be roughly categorized into network-based techniques and machine learning-based techniques. Network-based methods generally construct a few lncRNA/protein-related networks and then design a network algorithm to compute the probabilities of interactions between lncRNAs and proteins. Zhao et al. [15] and Ge et al. [16] designed two bipartite network-based recommended algorithm to score each lncRNA–protein pair. Zhou et al. [17] proposed a similarity kernel fusion method for LPI prediction (LPI-SKF). Zheng et al. [18] fused multiple protein similarity networks to uncover potential associations between lncRNAs and proteins.

Machine learning-based methods select features for lncRNAs and proteins to describe an lncRNA–protein pair, and use the extracted features as input to train a supervised learning model for possible LPI identification. The type of methods contain matrix factorization-based models, ensemble learning-based models, and deep learning-based models. To discover new LPIs, Liu et al. [19], Zhang et al. [20], and Ma et al. [21] explored neighborhood regularized logistic matrix factorization method, graph regularized nonnegative matrix factorization model, and projection-based neighborhood nonnegative matrix decomposition method (PMKDN), respectively.

Ensemble learning-based techniques have been widely available for LPI identification. Hu et al. [22] presented a unified framework combining support vector machines, random forests, and extreme gradient boosting. Zhang et al. [23] designed a feature projection ensemble learning-based framework (SFPEL). Deng et al. [24] picked lncRNA and protein information including HeteSim features and diffusion features and constructed a gradient tree boosting algorithm (PLIPCOM). Fan et al. [25] explored a broad learning system-based ensemble classification model. Wekesa et al. [26] exploited a categorical boosting approach (LPI-CatBoost). Yi et al. [27] proposed a stacking ensemble learning algorithm.

Deep learning architectures can better learn hidden information in raw data and characterize data in each layer based on nonlinear transformations [28]. Therefore, deep learning has been a research hotspot in the area of bioinformatics [6, 29,30,31]. In LPI prediction, deep learning demonstrates also broad application, such as the works provided by [32,33,34,35]. Deng et al. [32] proposed a deep neural network for predicting binding site of RNA-binding proteins. Wei et al. [35] fused biological feature blocks via Deep Neural Network (DNN). Zhang et al. [33] presented an ensemble deep learning model for identifying interaction biomolecule types for lncRNAs. Wekesa et al. [34] explored a graph attention-based deep learning model to predict plant LPIs. Zhao et al. [36] developed a graph convolutional network-based method to prioritize target protein-coding genes of lncRNAs. Shaw et al. [37] exploited a multimodal deep learning model to identify relationships between lncRNAs and protein isoforms.

Computational methods effectively discovered many potential relevances between lncRNAs and proteins. However, network-based techniques fail to find possible proteins/lncRNAs for an orphan lncRNA/protein. Machine learning-based LPI prediction approaches remain the following problems to solve. First, most methods are measured on one dataset, which may result in prediction bias. Second, the majority of methods are validated under Cross Validation (CV) on lncRNA–protein pairs, ignored the performance under the other CVs, for example, CVs on lncRNAs or proteins. Finally, features of lncRNAs and proteins are required to further integration. The details are summarized in Table 1.

Table 1 Summarization of existing studies and the proposed method

1.3 Study Contributions

In this manuscript, an ensemble learning framework (EnANNDeep) is developed to quantify the interplays between lncRNAs and proteins. EnANNDeep integrates diverse biological information, Adaptive k-nearest neighbor (AkNN) classifier, deep neural network, Deep forest, and ensemble learning theory to a unified framework. The work mainly has the following three contributions:

  1. 1.

    An ensemble learning framework, composed of AkNN algorithm, DNN, and deep forest, is exploited to greatly learn labels of unknown lncRNA–protein pairs.

  2. 2.

    The proposed AkNN classification model separately selects the right k for each neighborhood and provides an upper bound for the failure probability.

  3. 3.

    Deep models including DNN and deep forest better represent biological features for each lncRNA–protein pair.

2 Materials and Methods

2.1 Data Preparation

In this study, five different LPI-related datasets are arranged. Table 2 shows the details of the five datasets. Datasets 1, 2, and 3 contain human LPI data and datasets 4 and 5 contain plant LPI data. Dataset 1 was provided by Li et al. [38]. We obtain 3479 correlations from 935 lncRNAs and 59 proteins after removing lncRNAs and proteins whose sequence information is unknown in the NPInter [39], NONCODE [40], and UniProt [41].

Dataset 2 was built by Zheng et al. [18]. We screen 3265 relationships from 885 lncRNAs and 84 proteins after the preprocessing similar to dataset 1. Dataset 3 was constructed by Zhang et al. [42] and contains 4158 interplays from 990 lncRNAs and 27 proteins.

Datasets 4 and 5 were from Arabidopsis thaliana and Zea mays, respectively. The former contains 948 interactions from 109 lncRNAs and 35 proteins and the latter provides 22,133 associations from 1704 lncRNAs and 42 proteins. Sequence data are extracted from the PlncRNADB database [43] and interaction data are obtained at http://bis.zju.edu.cn/PlncRNADB/.

We represent LPI network as a matrix Y with the element:

$$\begin{aligned} {y_{ij}} = \left\{ {\begin{array}{*{20}{l}} {1,\quad {\mathrm{\, if \,\,lncRNA \,\,}}{l_i}{\mathrm{\,\, interacts \,\,with \,\,protein\,\, }}{p_j}}\\ {0,\quad {\mathrm{\,otherwise}}} \end{array}} \right. \end{aligned}$$
(1)
Table 2 The statistics of LPI data

2.2 Overview of EnANNDeep

In this study, we develop an ensemble learning framework (EnANNDeep), composed of AkNN, DNN, and deep forest, to classify unknown lncRNA–protein pairs. Figure 1 describes the EnANNDeep framework.

As shown in Fig. 1, EnANNDeep mainly contains three procedures after five different LPI datasets are arranged. (1) Feature selection—An ensemble method combining gapped k-mer [44], tri-nucleotide composition [45], reverse complement k-mer [46], and RNAfold [47] is available for lncRNA feature selection. SSpro [48] and binary profile are used to chose protein features. (2) Classification—AkNN, DNN, and deep forest are exploited to obtain labels of unknown lncRNA–protein pairs, respectively. (3) Ensemble—The results from the above three predictors are integrated based on a soft voting technique.

Fig. 1
figure 1

The flowchart of the LPI-DLDN framework: (1) Feature selection; (2) Classification; (3) Ensemble

2.3 Feature Selection

2.3.1 lncRNA Feature Selection

The integration of various lncRNA and protein features contributes to improve LPI prediction accuracy. In this work, an ensemble approach is explored to represent lncRNA features. For given an lncRNA sequence L with length a, where \({l_i} \in \{A, C, G, T\}\) and \(\{i = 1, 2, \ldots , a\}\). EnANNDeep utilizes gapped 3-mer [44], tri-nucleotide composition [45], reverse complement 2-mer [46], and RNAfold [47] to characterize an lncRNA.

The tri-nucleotide composition technique is used to obtain evolutionary features from L. The tri-nucleotide compositions are extracted by scanning the sequence using \(\{(1,2,3), (2,3,4), \ldots , (a-2, a-1, a)\}\), where \(\{1, 2, 3, \ldots , i, \ldots , a\}\) denotes the i-th nucleotide in L.

The gapped 3-mer method applies 3-mer with gap to obtain local and global information from L. Let b represent the number of non-gapped positions in L, and the number of gaps is \(g=3-b\). A feature vector of L can be denoted by Eq. (2):

$$\begin{aligned} {f} = {[u_1,u_2,...,u_M]^T,} \end{aligned}$$
(2)

where \(u_{\mathrm{i}}\) is the number of the i-th gapped 3-mer in L, M is the number of all gapped 3-mers and \(M = \left( {\begin{array}{*{20}{c}} 3\\ b \end{array}} \right) {4^3}\).

The reverse complement 2-mer method is used to extract regulatory features from L. First, 2-mer is generated. Second, reverse complement 2-length contiguous subsequences are eliminated. Finally, the computed occurrence frequencies of the remaining 2-length subsequences are calculated to build an lncRNA feature vector.

The RNA secondary structures have been validated to positively affect protein binding site selection. A dynamic programming technique, RNAfold, is used to infer RNA secondary structures according to its minimum free energy. Five features with high probability structures are extracted by counting occurrence frequency of each unique structure.

2.3.2 Protein Feature Selection

To depict a protein, first, its secondary structures are obtained based on \(\alpha\)-helix (H), \(\beta\)-sheet (E), and coil (C) conformation using SSpro [48]. Second, 20 amino acids are divided into three categories based on the computed secondary structures: \(\alpha\)-helix contains eight amino acids (E, A, L, M, Q, K, R, and H), \(\beta\)-sheet contains seven amino acids (V, I, Y, C, W, F, and T), and coil contains five amino acids (G, N, P, S, and D). Third, an amino acid can be replaced by its conformation and thus each protein sequence can be represented using H, E or C. 27 3-tuples are obtained from the permutation of the above three conformations. Fourth, 3-tuple is applied to the replaced sequences and the number of each 3-tuple is computed. Finally, the occurrence frequency of each 3-tuple can be calculated by Eq. (3):

$$\begin{aligned} {a_{i}} = \frac{{{d_{i}}}}{{{a-3+1}}} (i=1, 2, \ldots , 27), \end{aligned}$$
(3)

where \({d_i}\) is the number of the i-th 3-tuples in L.

In addition, a binary profile describes composition and order of residues in a protein sequence. In this study, a binary profile with a \(20\times 16\) dimensions is produced based on a one-hot encoding of 20 amino acids. The details for lncRNA and protein feature extraction are described in Table 3. Thus an lncRNA–protein pair can be represented as a 554-dimensional vector \(\varvec{x}\) combining lncRNA and protein features.

Table 3 Numbers of the extracted lncRNA and protein features

2.4 Problem Description

Given an LPI training set \(D=(X,Y)\) with labels \(\{+1,-1\}\), where a separable metric space (X, 554) denotes the sample space with 554 features and \(Y=\{+1,-1\}\) describes the label space. A training example \(\varvec{x}\) is a 554-dimensional feature vector applied to characterize an lncRNA–protein pair, \(\varvec{y}\in \{ +1, -1\}\) denotes its label. The label of \(\varvec{x}\) is 1 when there is an interaction between the lncRNA and the protein; the label is -1, otherwise. For any query lncRNA–protein pair \(\varvec{x}_i\), we aim to construct an ensemble model, EnANNDeep, to obtain its label.

2.5 Adaptive k-Nearest Neighbor

2.5.1 k-Nearest Neighbor

k-Nearest Neighbor (k-NN) classifier [49] is a simple but effective classification model. It is very appropriate to a classification task where there is lack of prior knowledge about data distribution. The classifier investigates label of a test sample based on the Euclidean distance between the test sample and all training samples.

Given n LPI samples, let \(\varvec{x}_i \,\,(i=1,2,\ldots ,n)\) denote the i-th sample with 554 features \(({x}_{i,1}, {x}_{i,2}, \ldots , {x}_{i,554})\). The Euclidean distance between two samples \(\varvec{x}_i\) and \(\varvec{x}_j\) is represented as:

$$\begin{aligned} d({x_i},{x_j}) = \sqrt{{{({x_{i,1}} - {x_{j,1}})}^2} + \cdots + {{({x_{i,554}} - {x_{j,554}})}^2}}. \end{aligned}$$
(4)

Based on the theory provided by Voronoi [50], a Voronoi cell \(R_i\) for sample \(\varvec{x}_i\) encapsulates its all nearest neighbors and is defined by Eq. (5):

$$\begin{aligned} {R_i} = \{ \varvec{x}_a \in {{\mathbb {R}} ^p}: \, d(\varvec{x}_a,{\varvec{x}_i}) \le d(\varvec{x}_a,{\varvec{x}_m}),\forall i \ne m\}, \end{aligned}$$
(5)

where \(\varvec{x}_a\) denotes all possible points (samples) within \(R_i\), that is, the nearest neighbors of the example \(\varvec{x}_i\).

For any LPI, k-NN classifier determines the nearest samples through the closest edges within the Voronoi cell \(R_i\). A test sample is assigned a label the same as the majority category label of its k nearest training samples based on k-NN classifier.

k-NN uses a fixed radius and can automatically adapt to the variation in marginal distribution. Therefore, it has been broadly applied to various areas. However, the choice of its nearest neighbor number k severely depends on features of each neighborhood and thus may greatly vary between different points. In the input space, for the regions where conditional expectation of \(\varvec{x}\) tends to 0, larger k is required for accurate prediction. For other regions where the conditional expectation is \(+1\) or \(-1\), smaller k can satisfy the requirement and larger k may result in incorrect classification due to the inconsistence of labels in the neighboring regions. Thus k-NN classifier has to select a single value for k to trade off the above two situations. To solve this problem, AkNN classifier is designed to separately select the right k for each neighborhood.

2.5.2 Adaptive k-Nearest Neighbor

Inspired by the AkNN algorithm proposed by Balsubramani et al. [51], we design an AkNN algorithm to compute interaction probability for each lncRNA–protein pairs. For a training set \({(\varvec{x}_1, \varvec{y}_1), (\varvec{x}_2, \varvec{y}_2)}\), \(\ldots\), \((\varvec{x}_n, \varvec{y}_n) \in {\mathcal {X}} \times {\mathcal {Y}}\), let all LPI data draw from an unobserved independent identically distribution P on \({\mathcal {X}} \times {\mathcal {Y}}\). Let \(\mu\) represent the marginal distribution on \({\mathcal {X}}\): if (XY) denotes a random draw from P, let \(\eta (\varvec{x}) = \mathrm{E}(Y|X = \varvec{x})\), then for any measurable set \(S \subseteq {\mathcal {X}}\):

$$\begin{aligned} \mu (S) = \mathrm{Pr}(X \in S). \end{aligned}$$
(6)

For any given sample \(\varvec{x} \in {\mathcal {X}}\), conditional expectation of Y can be denoted by Eq. (7):

$$\begin{aligned} \eta (\varvec{x}) = E(Y|X = \varvec{x}) \in [ - 1, \, 1]. \end{aligned}$$
(7)

For any S where \(\mu (S)>0\) and given \(X \in S\), conditional expectation of Y can be described by Eq. (8):

$$\begin{aligned} \eta (S) = E(Y|X \in S) = \frac{1}{{\mu (S)}}\int _S {\eta (\varvec{x})\mathrm{d}\mu (\varvec{x})}. \end{aligned}$$
(8)

Thus the error risk of k-NN classifier: \(g: {\mathcal {X}} \rightarrow \{-1, \, +1\}\) is the probability that it incorrectly classifies a query sample on the training set \((X,Y) \sim P\). The risk is denoted by Eq. (9):

$$\begin{aligned} R(g)=P(\{(\varvec{x},\varvec{y}):\,\, g(\varvec{x}) \ne \varvec{y}\}). \end{aligned}$$
(9)

For \(\varvec{x} \in {\mathcal {X}}\) and \(r>0\), let \(B(\varvec{x},r)\) represent the closed ball with radius r centered at \(\varvec{x}\):

$$\begin{aligned} B(\varvec{x},r)=\{\varvec{z} \in {\mathcal {X}}:d(\varvec{x},\varvec{z}) \le r\}. \end{aligned}$$
(10)

For a query lncRNA–protein pair \(\varvec{x}\), AkNN classifier predicts its label based on the training lncRNA–protein pairs closest to \(\varvec{x}\). The empirical count is defined by Eq. (11):

$$\begin{aligned} \#_n(S)=|\{i: \varvec{x}_i \in S\}|. \end{aligned}$$
(11)

The probability mass can be described by Eq. (12):

$$\begin{aligned} \mu _n(S)=\frac{\#_n(S)}{n}. \end{aligned}$$
(12)

When the empirical count is non-zero, the empirical bias can be defined by Eq. (13).

$$\begin{aligned} {\eta _n}(S) = \eta _n^y(S) - \frac{1}{{|Y|}}, \end{aligned}$$
(13)

where n indicates the number of all lncRNA–protein pairs. |Y| denotes the number of classes. In this manuscript, |Y| is 2.

$$\begin{aligned} \eta _n^y(S) = \frac{{{\# _n}\{ {\varvec{x}_i} \in S\,\, \mathrm{and}\,\,{\varvec{y}_i} = y\} }}{{{\# _n}(S)}}. \end{aligned}$$
(14)

AkNN classification model is described in Algorithm 1. The label of a query lncRNA–protein pair \(\varvec{x}\) can be predicted through expanding a ball around \(\varvec{x}\) until it produces a significant bias based on Algorithm 1.

figure a

In Algorithm 1, \(\Delta (n,k,\delta )\) denotes a confidence interval of average labels in the region closest to the query sample \(\varvec{x}\). \(c_1\) represents a constant.

Algorithm 1 infinitely makes many parameter selection. It picks k for each query point and asks for a single failure probability to measure how to assign its confidence intervals. In comparing to standard k-NN classifier, the AkNN classification algorithm seems to merely replace the parameter k with another parameter \(\delta\). However, it is not accurate. \(\delta\), a customary confident level parameter, provides an upper bound upon the failure probability for Algorithm 1.

To simplify the parameters, we replace \(\Delta\) with \(\Delta =\frac{A}{\sqrt{k}}\) in \(\Delta (n,k,\delta ) = {c_1}\sqrt{\frac{{\log n + \log (1/\delta )}}{k}}\). The parameter A is used to control conservations in Algorithm 1 and A → 0 denotes the most aggressive setting where Algorithm 1 never abstains. The detailed discussion is provided by Balsubramani et al. [51].

2.5.3 Deep Neural Network

The rapid development of machine learning models and computer hardware promotes the birth of DNNs. DNN is a feed-forward artificial neural networks. A DNN consists of one input layer, multiple hidden layers composed of nonlinear hidden units, and one larger output layer. The input layer achieves the original data. Each hidden unit j in a hidden layer uses an activation function to map the input \(x_j\) from the input layer to a scalar state. The output layer accommodates multiple hidden Markov model states.

DNNs have been already broadly applied to various association prediction [28]. For example, Zhao et al. [52] identified drug–target interactions combining graph convolutional network and DNN. Chu et al. [29] developed an optimized DNN to screen epidermal growth factor receptor inhibitors. Wang et al. [53] exploited a deep convolutional neural network-based drug–target interactions algorithm. Wei et al. [35] designed a DNN-based lncRNA-disease association prediction approach.

In this study, we utilize DNN to reveal possible LPIs. The DNN-based LPI prediction framework is shown in Fig. 2.

Fig. 2
figure 2

The flowchart of DNN-based LPI prediction algorithm

In the DNN model, the input layer has 554 neurons and achieves the input LPI samples with 554-dimensional features. The following two layers are hidden layers. The two layers are full connection layers containing 128 and 64 neurons, respectively. And each hidden layer follows by a dropout layer with the rate of 0.5 to avoid over-fitting by setting the output of 50% units to 0. Exponential Linear Unit (ELU) is considered as an activation function in the hidden layers. ELU can alleviate gradient vanishing, make the average output of an activation unit closer to 0 to achieve the effect of batch normalization and reduce the computation time. In addition, ELU is only qualitative but not quantitative for the input characteristics because it is an exponential function when it is negative. More importantly, ELU contributes to faster learning and better generalization ability on DNNs. It is denoted by Eq. (15):

$$\begin{aligned} {y_i}= \left\{ {\begin{array}{*{20}{l}} {a({e^{{x_i} - 1}})\quad \mathrm{if}\quad ({x_i} < 0)}\\ {{x_i}\quad \qquad \,\, \mathrm{if} \quad ({x_i} \ge 0)} \end{array}} \right. \end{aligned}$$
(15)

Our objective is to quantify how many the predicted labels differ from the real ones by minimizing the binary cross-entropy in the process of training by Eq. (16):

$$\begin{aligned} L = - \frac{1}{n}\sum \limits _{i = 1}^n {[{y_i}\log {{{\widehat{y}}}_i} + (1 - {y_i})\log (1 - {{{\widehat{y}}}_i})]}. \end{aligned}$$
(16)

where \({\varvec{y}_i}\) is the true label and \({\widehat{\varvec{y}}_i}\) denotes the probability that the i-th sample is predicted to be positive LPI. The training is implemented with 100 epochs and each epoch has a mini-batch with the size of 128 to update its weights. We use the Adam algorithm [54] as the optimization technique to train DNN.

The final output layer contains a single neuron to output an interaction probability for each query lncRNA–protein pair based on a sigmoid function defined by Eq. (17):

$$\begin{aligned} y_i=\frac{1}{1+e^{-x_i}}. \end{aligned}$$
(17)

The sigmoid function can map a real number to the interval of (0,1). It is smooth and easy to derivation and is thus used as an activation function in the output layer of DNN to compute interaction score for each lncRNA–protein pair.

2.5.4 Deep Forest

To tackle complicated tasks, learning models gradually go deep [55]. However, traditional deep algorithms are always designed based on neural networks. Non-neural network style-based deep models will demonstrate great learning ability if they can go deep, especially when neural networks are multi-layered deep models with parameterized differentiable nonlinear modules. Considering this feature of neural networks, deep forest [56, 57], a non-neural network style deep model, is built upon multi-grained cascade framework.

Deep forest is a novel ensemble algorithm. Its feature learning capability is further boosted by multi-grained scanning the input data. Second, its complexity can be automatically set. Third, it performs better even on small-scale data. Finally, the training costs can be controlled based on available computational resources. Deep forest only needs to train much fewer hyper-parameters in comparing to other deep learning models. Therefore, deep forest obtains highly competitive classification ability while its training time drops sharply.

In this manuscript, deep forest with no more than 20 layers is utilized to classify unobserved lncRNA–protein pairs. Random forest [58, 59] and Extra trees [60] are chosen as basic classifiers. The random forest technique [58, 59] is a general-purpose, nonparametric, and interpretable classification model. It is an ensemble of a few randomized decision trees and can return measurements of variable importance. It has unique characteristics in dealing with complex data structures, small sample size, and high-dimensional feature space. In particular, it demonstrates excellent performance when the number of variables is far more than the number of samples.

The Extra tree model [60] is an ensemble of unpruned decision trees based on the classical top–down procedure. Extra tree has three advantages: First, it splits nodes by fully randomly selecting cut-points and contributes to more strongly reduce variance than the weaker randomization algorithms. Second, it utilizes the whole learning samples rather than a bootstrap replicas to minimize classification bias. Finally, it contains a node splitting scheme to obtain much smaller constant factor during cut-point optimization.

In the proposed deep forest model, each cascade layer consists of two random forests and two Extra trees. Each estimator consists of 100 decision trees. In each layer, for a given LPI feature, each classifier calculates the ratio of the feature belonging to positive class or negative class. The predicted class probability from all classifiers forms a class vector. The vector is concatenated with the raw LPI feature vector as input in the next level.

As illustrated in Fig. 3, a 554-dimensional vector is taken as input of deep forest. After training four basic classifiers, an 8-dimensional class vector is produced and concatenated with the 554-dimensional vector to generate a 562-dimensional feature vector. The produced vector is considered as input in the second layer. Similar to the first layer, the second layer of deep forest also generates another 562-dimensional vector applied to the third layer. If the estimated performance outperforms all previously-constructed layers, deep forest continues to increase a new layer. The model will terminate training when its performance fails to improve in the successive two layers. Finally, in the output layer, for each lncRNA–protein pair, its predicted interaction probability belonging to positive class and negative class is averaged, respectively. The class that the lncRNA–protein pair has higher average interaction probability is chosen as the final class.

Fig. 3
figure 3

The flowchart of deep forest

In particular, similar to DNN, deep forest utilizes a cascade structure. In the structure, each level receives features from its preceding level, and outputs the results to next level. Therefore, although the proportion of an 8-dimensional class vector in the input layer may be relatively smaller, its proportion in an LPI feature vector will continuing increase with the deepening of the number of layers. Therefore, in our model, the 8-dimensional class vector cannot be drown out.

2.5.5 Ensemble Learning

Ensemble learning demonstrates better prediction accuracy of a single model through training multiple classifiers and integrating their predictions [27, 61, 62]. Chen et al. [63] exploited a decision tree ensemble algorithm to uncover possible miRNA-disease associations. Zhang et al. [23] designed a sequence feature projection-based ensemble learning model to identify LPI candidates. Yi et al. [27] exploited a stacking ensemble learning algorithm to discover ncRNA–protein interactions.

Although AkNN, DNN, and deep forest can effectively predict LPIs, their predictive performance remains improvement. In this study, we present a soft voting-based ensemble learning framework, composed of AkNN, DNN, and deep forest, to enhance the classification ability of existing single model. Let \(S_{AkNN}\), \(S_{DNN}\), and \(S_{DF}\) denote association probability of an lncRNA–protein pair obtained by AkNN, DNN, and deep forest, respectively, its final relevance score is defined by Eq. (18) based on a soft voting technique:

$$\begin{aligned} S=\frac{1}{3}S_{AkNN}+\frac{1}{3}S_{DNN}+\frac{1}{3}S_{DF.} \end{aligned}$$
(18)

An lncRNA–protein pair is labeled as positive class if its score is larger than 0.5 based on Eq. (18); otherwise, the lncRNA–protein pair is classified to negative.

3 Results

3.1 Evaluation Metrics

In the experiments, precision, recall, accuracy, F1 score, AUC and AUPR are applied to assess the performance of EnANNDeep. For the six measurements, higher values indicate better prediction ability. The experiments are repeatedly implemented for 20 times and the average values on the 20 rounds are selected as the final performance.

3.2 Experimental Settings

We conduct grid search to find the optimal parameters in SFPEL, PMDKN, CatBoost, PLIPCOM, and EnANNDeep when the five LPI prediction approaches obtain the best performance. The details are listed in Table 4. The parameters in LPI-SKF are set to default values provided by Zhou et al. [17].

Table 4 Parameter settings

In addition, to investigate the prediction performance of EnANNDeep for a new lncRNA or protein, three different fivefold CVs are designed.

  1. 1.

    Fivefold CV on lncRNAs ( \(CV_{l}\)): rows in Y are randomly hidden for testing, that is, 80% of lncRNAs are randomly chosen as a training set and the remaining 20% is used as a testing set in each round. \(CV_{l}\) is used to find interacting proteins for a new lncRNA without any associated proteins.

  2. 2.

    Fivefold CV on proteins ( \(CV_{p}\)): columns in Y are randomly hidden for testing, that is, 80% of proteins are randomly chosen as a training set and the remaining 20% is used as a testing set in each round. \(CV_{p}\) is used to identify interacting lncRNAs for a new protein without any associated lncRNAs.

  3. 3.

    Fivefold CV on lncRNA–protein pairs ( \(CV_{lp}\)): lncRNA–protein pairs in Y are randomly hidden for testing, that is, 80% of lncRNA–protein pairs are chosen as a training set and the remaining 20% is used as a testing set in each round. \(CV_{lp}\) is used to uncover interaction information based on known LPIs.

3.3 Comparison with Five State-of-the-Art LPI Prediction Methods

We compare the proposed EnANNDeep method with five representative LPI prediction methods (SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF) to measure the prediction performance of EnANNDeep. SFPEL is an ensemble learning method for LPI prediction based on sequence feature projection. SFPEL first extracted sequence features for lncRNAs and proteins and then computed lncRNA similarity and protein similarity. Finally, it used a feature projection-based ensemble learning framework to predict LPIs combining the computed similarity matrices.

PMKDN is a neighborhood nonnegative matrix decomposition model applied to possible LPI inference. PMKDN first selected multiple biological features of lncRNAs and proteins. Second, it combined protein GO ontology annotation and sequences, lncRNA sequences, and modified LPI network to calculate lncRNA similarity and protein similarity. Finally, it utilized a projection-based neighborhood nonnegative matrix decomposition algorithm to infer potential LPIs.

CatBoost is a new gradient boosting algorithm. CatBoost implemented two key techniques, that is, ordered boosting which is a permutation-driven alternative to a classification model, and a categorical feature procession strategy. The combination of them promotes CatBoost to outperform the other available boosting techniques. CatBoost has been applied to LPI discovery and obtained better LPI classification ability.

PLIPCOM employed two network features, diffusion features and HeteSim features, and built an LPI prediction model integrating the Gradient Tree Boosting (GTB) algorithm.

LPI-SKF first computed lncRNA similarity based on expression profiles and sequences of lncRNAs and LPI network, and protein similarity based on statistical features and sequences of proteins and LPI network. It then constructed a universal similarity kernel matrix for new LPI identification based on a similarity kernel fusion technique.

We evaluate the performance of our proposed EnANNDeep framework under three different fivefold CVs. During CVs, we randomly select unknown lncRNA–protein pairs as negative samples (non-LPIs). To reduce the overfitting problem produced by data imbalance, we set the ratio of negative LPIs to known LPIs as 1. That is, the number of the screened negative LPIs is the same as one of observed LPIs in the divided training set and test set. The best measurements are represented as bold in each row in Tables 5, 6 and 7.

Table 5 illustrates the prediction results from six LPI identification models in terms of the above six evaluation metrics under \(CV_{l}\). EnANNDeep achieves the highest average precision, recall, accuracy, F1 score, AUC, and AUPR. In particular, compared to SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF, the average AUC computed by EnANNDeep outperforms 32.92%, 17.29%, 12.76%, 7.99%, and 3.94%, respectively. The average AUPR calculated by EnANNDeep are better 33.33%, 15.85%, 12.78%, 9.76%, and 5.29% than the above five methods. The result suggest that EnANNDeep may be suitable to linkage discovery for a new lncRNA.

Table 5 The performance of five LPI prediction methods on \(CV_{l}\)

Table 6 describes the six evaluation values under \(CV_{p}\). From Table 6, it can be found that EnANNDeep computes the best average precision, recall, accuracy, AUC and AUPR under \(CV_{p}\). Although EnANNDeep calculates relatively lower F1 score, it greatly boosts the precision, recall, accuracy, AUC, and AUPR performance. For example, compared to SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF, its AUC boosts 42.76%, 22.23%, 31.74%, 21.79%, and 25.74%, respectively, AUPR improves 36.15%, 14.68%, 31.87%, 23.25%, and 18.82%, respectively. AUC and AUPR can more representatively characterize the performance of classifiers compared to the other four measurements. EnANNDeep distinctly outperforms the other five algorithms in terms of AUC and AUPR. Therefore, it is appropriate to prioritize potential lncRNAs for a new protein.

Table 6 The performance of six LPI prediction methods on \(CV_{p}\)

The experimental results under \(CV_{lp}\) are listed in Table 7. The results illustrate the optimal LPI classification ability of EnANNDeep. Under \(CV_{lp}\), EnANNDeep obtains the best average recall, accuracy, F1 score, AUC, and AUPR. For example, it computes F1 score of 0.8569, which is 9.46%, 30.93%, 8.51%, 3.09%, and 18.09% better than SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF, respectively. The computed average AUC outperforms 7.07%, 17.53%, 6.55%, 2.43%, and 1.13%, respectively, and AUPR is better 4.67%, 14.98%, 4.19%, 3.74%, and 4.83%, respectively. SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF are state-of-the-art LPI prediction algorithms. EnANNDeep greatly outperforms the five methods. The comparative results suggest the powerful performance of EnANNDeep under \(CV_{lp}\). That is, EnANNDeep can more accurately mine underlying relationships between lncRNAs and proteins even in the absence of some LPIs.

Table 7 The performance of six LPI prediction methods on \(CV_{lp}\)

3.4 Comparison of Different Voting Methods

We conduct several experiments to observe the affect of voting techniques on the classification performance. We consider two voting techniques: soft voting approach and hard voting approach. Given an unobserved lncRNA–protein pair, the hard voting method first obtains label of an lncRNA–protein pair based on classification results from AkNN, DNN, and deep forest, respectively. Hard voting then classifies the sample as positive if its classification results from no less than two basic predictors are positive; otherwise, the pair is labeled as a negative class. The comparison results of two voting approaches under three CVs are shown in Tables 8, 9 and 10. From Tables 8, 9 and 10, we can find that the soft voting-based ensemble learning model can obtain better performance compared to the hard voting method.

Table 8 Comparison of two voting methods on \(CV_{l}\)
Table 9 The performance of two voting methods on \(CV_{p}\)
Table 10 The performance of two voting methods on \(CV_{lp}\)

3.5 The Effect of Numbers of RNA Secondary Structures on the Performance

Although abundant biological information contributes to improve LPI prediction performance, the biological features exist information robust and increase computational complexity. Therefore, we select the representative features to describe lncRNA secondary structures. Tables 11, 12 and 13 list the performance of EnANNDeep based on 5, 10, 64, and 128 RNA secondary structures with high probability. The results indicate that five features with high probability structures can accurately depict RNA secondary structures. Therefore, we chose five lncRNA secondary structures to reduce computation cost.

Table 11 The effect of the number of RNA secondary structures on performance under \(CV_{l}\)
Table 12 The effect of the number of RNA secondary structures on performance under \(CV_{p}\)
Table 13 The effect of the number of RNA secondary structures on performance under \(CV_{lp}\)

3.6 Case Study

In this section, we implement several cases to further evaluate the performance of EnANNDeep. We run the experiments for ten times and compute the average performance from the ten time results.

3.6.1 Finding Interacting Proteins for New lncRNAs

LncRNA Small Nucleolar RNA Host Gene 1 (SNHG1) has close linkage with multiple human diseases. For example, SNHG1 is up-regulated in gastric cancer and may serve as a potential therapeutic target of gastric cancer [64]. It promotes cell proliferation and cell cycle progression carcinoma and inhibits cell apoptosis in hepatocellular carcinoma [65]. It also enhances neuroinflammation in Parkinson’s disease [66]. In addition, nonsmall cell lung cancer has been reported to associate with upregulated SNHG1 [67].

In the three human dataset, SNHG1 interacts with 6, 18, and 4 proteins, respectively. To find interacting proteins with SNHG1, all its interaction information is hidden. The six LPI prediction algorithms are then applied to discover potential proteins for SNHG1. The predicted top 5 proteins are shown in Table 14. It can be found that Q15717, O00425, Q9Y6M1, P35637, and Q9NZI8 are inferred to have the highest interaction probabilities with SNHG1 in dataset 1. Although interactions between the five proteins and SNHG1 are unlabeled in Dataset 1, O00425 and P35637 have been reported to have close relationships with SNHG1 in datasets 3 and 2, respectively. Q15717, Q9Y6M1, and Q9NZI8 have been shown to link with SNHG1 in both datasets 2 and 3. In addition, all the inferred top 5 proteins linking with SNHG1 have higher rankings in SFPEL, PMDKN, CatBoost, PLIPCOM, LPI-SKF, and EnANNDeep. The ranking results again demonstrate the LPI classification ability of EnANNDeep for a new lncRNA.

Table 14 The predicted top 5 proteins interacting with SNHG1

3.6.2 Finding Interacting lncRNAs for New Proteins

Q9UKV8 can inhibit translation initiation through interaction with translation initiation factor EIF6 and prevent the recruitment from translation initiation factor EIF4-E. It up-regulates translation under the situation of serum starvation by binding to the AU element. More importantly, it is also interrelated with transcriptional gene silencing [41].

Q9UKV8 interacts with 207, 205, and 222 lncRNAs on three human datasets, respectively. In this section, its association information with lncRNAs is hidden and EnANNDeep is used to reveal its relevant lncRNAs. The found top 5 human lncRNAs interacting with Q9UKV8 are shown in Table 15.

On dataset 1, it can be observed that DANCR, RPI001_1039837 and AL139819.1 are inferred to interact with Q9UKV8. Although the associations between the three lncRNAs and Q9UKV8 are unknown in dataset 1, DANCR has been reported to interact with Q9UKV8 in dataset 2, RPI001_1039837 and AL139819.1 have been validated to interact with Q9UKV8 in dataset 3.

On dataset 2, it can be seen that RMRP, SNORD17, and RPI001_483534 have been predicted to interact with Q9UKV8. Although the relationships between RMRP and SNORD17 and Q9UKV8 are unknown in dataset 2, the two lncRNAs have been shown to link with Q9UKV8 in datasets 1 and 3, respectively. The interaction between RPI001_483534 and Q9UKV8 can not be retrieved on the three datasets. However, it ranks as 5, 2, 478, 183, 9, and 86 by EnANNDeep, SPFEL, PMDKN, PLIPCOM, LPI-CatBoost, and LPI-SKF, respectively. The higher rankings in EnANNDeep, SPFEL, and PLIPCOM suggest that RPI001_483534 may be relative to Q9UKV8 and needs further validation.

On dataset 3, RPI001_84645 and EXOC3 are identified to interact with Q9UKV8. The associations between the two lncRNAs and Q9UKV8 can be searched in datasets 2.

Table 15 The predicted top 5 lncRNAs interacting with Q9UKV8

3.6.3 Finding New LPIs Based on Known LPIs

Potential LPIs are subsequently identified by EnANNDeep based on labeled LPIs. The inferred top 50 LPIs with the highest correlation probabilities on the five datasets are illustrated in Figs. 4, 5, 6, 7 and 8. The 50 associations contain known and unknown LPIs.

Fig. 4
figure 4

The predicted top 50 LPIs on dataset 1

Fig. 5
figure 5

The predicted top 50 LPIs on dataset 2

Fig. 6
figure 6

The predicted top 50 LPIs on dataset 3

Fig. 7
figure 7

The predicted top 50 LPIs on dataset 4

Fig. 8
figure 8

The predicted top 50 LPIs on dataset 5

The ranking results show that interactions between SNHG10 and Q15717, VIM-AS1, and Q15717, RPI001_1 01_2148 and ENSP00000385269, AthlncRNA-159 and 22328551, and ZmalncRNA-1314 and B6SP74 are the most possible LPIs among unlabeled lncRNA–protein pairs on datasets 1–5, respectively. They are ranked as 4, 14, 5, 33, and 1972 among 55,165, 74,340, 26,730, 3815, and 71,568 lncRNA–protein pairs, respectively.

lncRNA SNHG10 is a novel driver in the process of development and metastasis in hepatocellular carcinoma [68]. The lncRNA has close linkages with cell proliferation in gastric cancer [69], non-small cell lung cancer [70], and osteosarcoma [71]. Q15717 is an RNA-binding protein [72]. The protein contributes to embryonic stem cell differentiation, and can increase the leptin mRNA’s stability, and mediate the CDKN2A anti-proliferative activity [41]. Both SNHG10 and Q15717 have dense linkages with cell proliferation activity. We infer that SNHG10 may interact with Q15717 and is worthy of further validation.

4 Discussion

Identification of LPI candidates contributes to discover functions and mechanisms of lncRNAs. In this manuscript, an ensemble framework combining AkNN, DNN, and deep forest is developed to find possible interactions between lncRNAs and proteins. Three different CVs are conducted to compare the proposed EnANNDeep model with the other LPI prediction methods. The experimental results indicate that EnANNDeep can be more accurately applied to new LPI discovery.

Under \(CV_{p}\), majority of performance achieved from SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF is much lower than those of \(CV_{l}\) and \(CV_{lp}\). Under \(CV_{p}\), 80% of lncRNAs are used to train the model and the remaining is applied to test the model. On five LPI datasets, each lncRNA may associate with 59, 84, 27, 35, and 42 proteins, respectively. When 20% of proteins are masked their associations, it may shield many LPIs and thus reduce the fitting level of a classification model to the LPI data. Therefore, abundance level of data severely affects the learning capacity of the five models. In comparison, under \(CV_{p}\), the performance obtained from EnANNDeep keeps relatively steady or even outperforms ones in comparing to \(CV_{l}\) and \(CV_{lp}\). The results demonstrate the robustness of the proposed EnANNDeep algorithm under CVs.

More importantly, similar to EnANNDeep, SFPEL, CatBoost, and PLIPCOM are three ensemble learning-based algorithms. The four ensemble learning-based LPI prediction methods integrate sequence information related to lncRNAs and proteins. SFPEL is a feature projection-based technique. CatBoost and PLIPCOM are gradient tree boosting and categorical boosting algorithms, respectively. EnANNDeep outperforms the three ensemble learning models, demonstrating the superior classification ability of basic predictors. That is, AkNN, DNN, and deep forest can be more effectively integrated to find possible LPIs. In addition, a few case analysis further suggest that EnANNDeep can mine useful information for a new lncRNA or protein.

The EnANNDeep framework demonstrates the powerful LPI discovery ability, especially under \(CV_{lp}\). It may be attributed to the following characteristics. First, a deep model composed of DNN and deep forest exhibits the optimal feature representation ability. In particular, deep forest works well even on small-scale data. Second, the proposed AkNN classifier can separately pick the most appropriate k for each query point so that the algorithm can better set the confidence intervals. Third, the ensemble framework from AkNN, DNN, and deep forest can effectively integrate the prediction results from the three predictors and thus improves the classification performance of EnANNDeep. Finally, it integrates multiple biological information related to LPI.

Although EnANNDeep can precisely identify new LPIs, it has one limitation: we select negative LPIs from unlabeled lncRNA–protein pairs. Indeed, unknown lncRNA–protein pairs may contain positive LPIs, thereby affecting the prediction ability of a model.

5 Conclusions

lncRNAs play pivotal roles in regulating many hallmarks of cancer biology. To decipher the lncRNA functions, we focus on new LPI mining. First, five LPI-related datasets are arranged. Second, the lncRNA and protein features are fused to depict each lncRNA–protein pair. Third, an ensemble model, composed of AkNN, DNN, and deep forest, is developed to classify unlabeled lncRNA–protein pairs, respectively. Finally, interaction probabilities of each lncRNA–protein pair from three predictors are integrated based on a soft voting technique to obtain the final classification. The results from comparative experiments and case analyses demonstrate that EnANNDeep can optimize the interplays between lncRNAs and proteins. Case analyses suggest that there probably exists an interaction between SNHG10 and Q15717.

In the future researches, first we will integrate various lncRNA-related datasets from different data sources to investigate the interaction biomolecules for lncRNAs, for example, lncRNA-miRNA interactions [73] and lncRNA-DNA interactions [36]. Second, more biological information from lncRNAs and proteins, for example, secondary structures of lncRNAs, secondary and tertiary structures of proteins, will be fused to represent an lncRNA–protein pair. Finally, we will develop a negative sample selection method based on positive-unlabeled learning to screen reliable negative LPIs.