Keywords

1 Introduction

Identification of protein–protein interactions (PPIs) is crucial for studying protein function and deep understanding of biological processes in a cell. In recent years, plenty of high-throughput technologies, such as Yeast two-hybrid (Y2H) screens [1, 2], tandem affinity purification (TAP) [3] and mass spectrometric protein complex identification (MS-PCI) [4], have been developed for the large-scale PPIs detection. However, these previous methods are expensive and require a great deal of human effort. In addition, only a small part of the whole PPIs have been identified. Thus, we can draw a conclusion that only using experimental methods is difficult to identify unknown PPIs.

In recent years, researchers have developed different types of computational methods for the prediction of PPIs [5,6,7,8,9,10,11,12,13,14,15,16]. For example, Li et al. proposed a novel computational model combining Position Weight Matrix (PWM) and Scale-Invariant Feature Transform (SIFT) algorithm [17]. An et al. proposed an effective algorithm that using Gray Wolf Optimizer–Based Relevance Vector Machine to identify PPIs [18]. Zhu et al. proposed a useful tool which used the Position-Specific Scoring Matrices (PSSMs) and ensemble learning algorithm Rotation Forest (RF) [19]. You et al. developed a novel method for detecting PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier [20]. Huang et al. proposed a sequence-based method based on the combination of weighted sparse representation based classifier (WSRC) and global encoding (GE) of amino acid sequence [21]. Wen et al. proposed an effective method based on similar network fusion (SNF) model to integrate the physical and chemical properties of proteins [22]. Leon Wong et al. presented a novel computational approach that combining a Rotation Forest Model with a novel PR-LPQ Descriptor [23]. Huang et al. developed an effective algorithm, which is based on Extreme Learning Machine (ELM) and combined the concept of Chou’s Pseudo-Amino Acid Composition (PseAAC) composition [24].

In this paper, we propose a novel computational method for predicting PPIs, which combines weighted sparse representation based classifier (WSRC) and the dual-tree complex wavelet transform (DTCWT). Specifically, we first select substitute matrix representation (SMR) based on BLOSUM62 to represent protein sequences. Then, we adopt the DTCWT to extract feature vectors from each SMR matrix. Finally, we utilize WSRC to predict PPIs on two different biological datasets: Yeast and Human. Our model achieves excellent performance results which obtain average accuracies of 97.12% and 97.56%, respectively. In order to further evaluating the proposed method, we compared the WSRC with the state-of-the-art support vector machine (SVM) classifier. The promising results demonstrated that the proposed method is robust and stable for the prediction of PPIs.

2 Materials and Methodology

2.1 Godden Standard Datasets

The first dataset that we chose in this paper is gathered from publicly available database of interacting proteins (DIP). We removed the protein pairs whose length are less than 50 residues because these might be fragments. The pairs with ≥40% sequence identity have been deleted too. In this way, the positive dataset is constructed by the remaining 5594 protein pairs. Moreover, we selected 5594 additional protein pairs of different subcellular localizations to build the negative dataset. Consequently, the whole dataset is made up of 11188 protein pairs.

In order to demonstrate the generality of the proposed method, we validated our method on another PPI dataset. We collected the dataset from the Human Protein References Database (HPRD). Those protein pairs which have ≥25% sequence identity have been removed. Finally, we used the remaining 3899 protein-protein pairs of experimentally verified PPIs from 2502 different human proteins, so that we can comprise the golden standard positive dataset. Following the previous work [25], we assume that the proteins in different subcellular compartments will not interact with each other and finally obtained 4262 protein pairs from 661 different human proteins as the negative dataset. As a result, the Human dataset is constructed by 8161 protein pairs.

2.2 Substitution Matrix Representation

Substitution matrix representation (SMR) is a modified version of representation method reported by [26]. In this novel matrix representation for proteins, we generated a \( N\; \times \;20 \) matrix to represent the \( N \)-length protein sequence, which are based on a substitution matrix. BLOSUM62 matrix is a powerful substitution matrix and has been utilized in this work for the sequence alignment of proteins. SMR can be defined as follows:

$$ SMR(i.j) = B(P(i),j) \, i = 1 \cdots N,j = 1 \cdots 20 $$
(1)

In this formula, B means the BLOSUM62 matrix, it is a 20 × 20 substitution matrix and \( B(i,j) \) represents the value in row \( i \) and column \( j \) of BLOSUM62 matrix, this value represents the probability rate of amino acid \( i \) converting to amino acid \( j \) in the evolution process; \( P = (p1,p2 \cdots pN) \) is the given protein sequence constructed by \( N \) amino acids.

2.3 The Dual-Tree Complex Wavelet Transform

The dual-tree complex wavelet transform (DTCWT) [27] is a variant of the traditional complex wavelet transform (DWT). It inherited the characteristics of the Multi-scale and Multi-resolution of discrete wavelet transform. At the same time, it makes up for the deficiencies of complex wavelet transform with large amount of calculation and high complexity. Different with the traditional DWT, DTCWT utilized two real DWTs to form a complex transform [28]. The first part symbolizes the real component and the second part represents the imaginary component of this transform.

The DTCWT settled the matters about the “shift-invariant problems” and “directional selectivity in two or more dimensions,” which are both weak points of conventional DWT [29]. It acquired directional selectivity by using approximate analytic wavelets. It also has the skills to generate a total of six directionally discriminating sub-bands oriented in the \( \pm 15^\circ \), \( \pm 45^\circ \) and \( \pm 75^\circ \) directions, for both the real \( (R) \) and imaginary \( (I) \) parts. Let \( h_{i} (n) \) and \( g_{i} (n) \) be the filters in the first stage. Let the new stage response of the first filter bank be \( H_{new}^{(k)} (e^{jw} ) \) and second filter bank be \( H_{new}^{'(k)} (e^{jw} ) \); we now have the following result.

Suppose one is provided with CQF pairs \( \left\{ {h_{0} (n),h_{1} (n)} \right\},\left\{ {h_{0}^{'} (n),h_{1}^{'} (n)} \right\} \). For \( k\text{ > }1 \).

$$ H_{new}^{(k)} (e^{jw} ) = H\left\{ {H_{new}^{'(k)} (e^{jw} )} \right\} $$
(2)

if and only if

$$ h_{0}^{'(1)} (n) = h_{0}^{(1)} (n - 1) $$
(3)

A 2D image \( f(x,y) \) can be decomposed by 2D DTCWT over a series of dilations and translations of a complicated scaling function and six complex wavelet function \( \varphi_{j,l}^{\theta } \); that is:

$$ f(x,y) = \sum\limits_{{l \in Z^{2} }} {s_{{j_{0} ,l^{\phi } j_{0} }} l^{(x,y)} + \sum\limits_{\theta \in \Uptheta } {\sum\limits_{{j \ge j_{0} }} {\sum\limits_{{l \in Z^{2} }} {c_{j}^{\theta } ,l^{{\varphi_{j}^{\theta } }} ,l^{(x,y)} } } } } $$
(4)

where \( \theta \in \Uptheta = \left\{ { \pm 15^\circ ,\; \pm 4 5^\circ ,\; \pm 7 5^\circ } \right\} \) gives the directionality of the complex wavelet function.

2.4 Weighted Sparse Representation-Based Classification

In the past twenty years, sparse representation based classifier (SRC) [30, 31] has earned considerable attention in the field of signal processing, pattern recognition and computer vision because of the great development of linear representation methods (LRBM) and compressed sensing (CS) theory. Sparse representation attempts to optimize matrix to reveal the relationship between any given test sample and the training set. Therefore, it would be a good trial to use it for building a prediction system for PPIs. In this work, we build a computational model by employing weighted sparse representation-based classifier (WSRC).

Given a training sample matrix \( x \in R^{m\; \times \;n} \) which is made up of \( n \) samples of \( m \) dimensions. If there are sufficient training samples belonging to the \( kth \) class, then the sub-matrix constructed by the samples of the \( kth \) class can be symbolized as \( X_{k} \left[ {l_{k1} ,l_{k2} \cdots l_{{kn_{k} }} } \right] \), where \( l_{i} \) denotes the class of \( ith \) sample and \( n_{k} \) is the number of samples belonging to \( kth \) class. Thus, X can be further rewritten as \( X = \left[ {X_{1} ,X_{2} \cdots X_{K} } \right] \), where \( K \) is the class number of the whole samples. Given a test sample \( y \in R^{m} \) and it can be represented as

$$ y = \alpha_{k,1} l_{k,1} + \alpha_{k,2} l_{k,2} + \cdots + \alpha_{{k,n_{k} }} l_{{k,n_{k} }} $$
(5)

when considering the whole training set representation, Eq. (5) can be further symbolized as

$$ y = X\alpha_{0} $$
(6)

where \( \alpha_{0} = \left[ {0, \ldots 0,\alpha_{k,2} , \cdots ,\alpha_{{k,n_{k} }} ,0, \cdots 0} \right]^{T} \). For these reason that the nonzero entries in \( \alpha_{0} \) are only associated with the \( kth \) class, so if class number of samples become large, the \( \alpha_{0} \) would come to be sparse. The key question of SRC algorithm is searching the \( \alpha \) vector which can subject to Eq. (6) and minimize the \( \ell_{0} \)-norm of itself:

$$ \begin{aligned} \widehat{{\alpha_{ 0} }} = \arg \hbox{min} \left\| \alpha \right\|_{0} \; \hfill \\ {\text{subject to}}\;y = X\alpha \hfill \\ \end{aligned} $$
(7)

Problem (7) is an NP-hard problem and it can be achieved but difficultly to be solved precisely. According to the theory of compressive sensing [32, 33] show, if \( \alpha \) is sparse enough, we can solve the related problem convex \( l_{1} \)-minimization problem instead of solving the \( l_{0} \)-minimization problem directly.

$$ \begin{aligned} \widehat{{\alpha_{1} }} = \arg \hbox{min} \left\| \alpha \right\|_{1} \;\; \hfill \\ {\text{subject to}}\;y = X\alpha \hfill \\ \end{aligned} $$
(8)

When dealing with occlusion, we should extend Eq. (8) to the stable \( \ell_{1} \)-minimization problem

$$ \begin{aligned} \widehat{{\alpha_{1} }} = \arg \hbox{min} \left\| \alpha \right\|_{1} \;\;\;\;\;\;\;\; \hfill \\ {\text{subject to}}\;\left\| {y - X\alpha } \right\| \le \varepsilon \hfill \\ \end{aligned} $$
(9)

where \( \varepsilon > 0 \) represent the tolerance of reconstruction error. Given the solution from Eq. (9), the SRC algorithm assigns the label of test sample \( y \) to class \( c \) with the reconstruction residual:

$$ \mathop {\hbox{min} }\limits_{c} r_{c} (y) = \left\| {y - X\widehat{{\alpha_{1} }}^{c} } \right\|, \, c = 1 \cdots K $$
(10)

Besides sparse representation, Nearest Neighbor (NN) is another popular classifier which only considering the influence of the Nearest Neighbor in training data to classify the test sample and SRC uses the linearity structure of data and overcomes the drawback of NN. Some researches shows that locality is more essential than sparsity in some cases [34, 35]. Lu et al. [36] have proposed a modified version of traditional sparse representation based classifier called weighted sparse representation based classifier (WSRC), it integrates the locality structure of data into basic sparse representation. Specifically, Gaussian distance between single sample and the whole training samples will be first computed and WSRC can use it as the weights of each training sample. The Gaussian distance between two samples, \( s_{1} \) and \( s_{2} \) can be described as follow:

$$ d_{G} (s_{1} ,s_{2} ) = e^{{ - \left\| {s_{1} - s_{2} } \right\|^{2} /2\alpha^{2} }} $$
(11)

where \( \sigma \) is the Gaussian kernel width. In this way, weighted sparse representation based classifier can retain the locality structure of data and then it turned to solve the following problem:

$$ \begin{aligned} \widehat{{\alpha_{ 1} }}{ = }\arg \hbox{min} \left\| {W\alpha } \right\|_{1} \; \hfill \\ {\text{subject to}}\;y = X\alpha \hfill \\ \end{aligned} $$
(12)

specifically,

$$ diag(W) = \left[ {d_{G} (y,x_{1}^{1} ), \ldots ,d_{G} (y,x_{{n_{k} }}^{k} )} \right]^{T} $$
(13)

where W is a block-diagonal matrix of locality adaptor and \( n_{k} \) is the sample number of training set in class \( k \). Dealing with this occlusion, we can eventually solve the following stable \( \ell_{1} \)-minimization problem:

$$ \begin{array}{*{20}l} {\widehat{{\alpha_{ 1} }}{ = }\arg \hbox{min} \left\| {W\alpha } \right\|_{1} } \hfill \\ {{\text{subject}}\;{\text{to}}\;\left\| {y - X\alpha } \right\| \le \varepsilon } \hfill \\ \end{array} $$
(14)

where \( \varepsilon \text{ > }0 \) is the tolerance value.

In summary, the WSRC algorithm can be summarized by the following steps:

figure a

3 Results and Discussion

In order to evaluate the performance of the proposed method, the overall prediction accuracy (Acc.), sensitivity (Sen.), precision (PR.), and Matthews’s correlation coefficient (MCC.) were calculated. They are defined as follows:

$$ A{\text{cc}}. = \frac{TP + TN}{TP + FP + TN + FN} $$
(15)
$$ Sen. = \frac{TP}{TP + FN} $$
(16)
$$ PR. = \frac{TP}{TP + FP} $$
(17)
$$ MCC. = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + TN) \times (TN + FN) \times (TP + FP) \times (TN + FN)} }} $$
(18)

In these algorithm, true positive (TP) represents the number of true samples which are predicted correctly; false negative (FN) is the number of samples predicted to be non-interacting pairs incorrectly; false positive (FP) is the number of true non-interacting pairs predicted to be PPIs falsely; and true negative (TN) is the number of true noninteracting pairs predicted correctly. What’s more, the receiver operating characteristic (ROC) curves are also calculated to evaluate the performance of proposed method. In order to summarize ROC curve in a numerical way, the area under an ROC curve (AUC) was computed.

3.1 Assessment of Prediction Ability

For the sake of fairness, when predicting PPIs of Yeast and Human, we set the same corresponding parameters of weighted sparse representation-based classifier. We set the \( \sigma = 1.5 \) and \( \varepsilon = 0.00005 \). In addition, 5-fold cross-validation [37] was employed in our experiments in order to avoid overfitting and get a stable and reliable model from the proposed method. Specifically, we divided the whole dataset into five subsets. Four of the subsets are used for training and the last part was used for testing. By this way, the results of these experiments in which we used the proposed model to predict PPIs of Yeast and Human datasets are shown in Tables 1 and 2.

Table 1. 5-fold cross-validation results obtained by using proposed method on Yeast dataset.
Table 2. 5-fold cross-validation results obtained by using proposed method on Human dataset.

When using the proposed method to predict PPIs of the Yeast dataset, we obtained the prediction results with average accuracy, precision, sensitivity, and MCC of 97.12%, 100%, 94.24% and 94.40%. The standard deviations of these criteria values are relatively low, which of accuracy, precision, sensitivity and MCC are 0.35%, 0.00%, 0.69%, and 0.67%, respectively. Form Table 2, when exploring the Human dataset, the proposed method yielded results of average accuracy, precision, sensitivity and MCC of 97.56%, 99.49%, 95.40%, and 95.23%. The standard deviations are 0.63%, 0.32%, 1.21%, and 1.19%, respectively. The ROC curves performed on these two datasets are shown in Fig.  1. To better evaluate the prediction performance of the proposed method, we computed the AUC values of Yeast and Human datasets, which are 97.14% and 97.31%, respectively.

Fig. 1.
figure 1

ROC from proposed method result for Yeast and Human PPIs dataset.

The high accuracies show that WSRC based model combining the SMR-DTCWT descriptors is feasible and effective for predicting PPIs. All experimental results demonstrate the feasibility, effectiveness and robustness of the proposed method.

3.2 Comparison with SVM-Based Method

In order to further evaluate the performance of the proposed method, we compared the prediction performance of the proposed method with the state-of-the-art SVM classifier on the dataset of Human and Yeast. We utilized the same feature extraction method and a grid search method to optimize two corresponding parameters of SVM \( c \) and \( g \). Here, we set the \( c = 0.3 \) and \( g = 0.3 \).

Table 3 shows that when using SVM to predict PPIs of Yeast dataset, we obtained relatively poor results with the average accuracy, precision, sensitivity, MCC, and AUC of 84.74%, 83.44%, 86.71%, 74.11%, and 91.99%, respectively. When exploring the Human dataset with the SVM-based method yielded relatively low results with the average accuracy, precision, sensitivity, MCC, and AUC of 83.14%, 81.92%, 83.08%, 71.92%, and 90.57%, respectively. Considering the comparison result and higher values for criteria and lower standard deviations, the prediction performance of SVM-based method is lower than that of WSRC. The ROC curves performed by SVM classifier on the two datasets are shown in Fig. 2.

Table 3. Comparison with support vector machine on Yeast and Human datasets
Fig. 2.
figure 2

ROC from SVM-based method result for Yeast and Human PPIs dataset.

4 Conclusions and Discussion

It is becoming more and more important to develop an effective and accurate method for predicting PPIs. In this work, we explore a novel computation model for predicting PPIs by combing weighted sparse representation-based classifier and the dual-tree complex wavelet transform. In the step of feature extraction, it has been proven that it is effective to combine the SMR matrix and dual-tree complex wavelet transform. Compared with the previous methods, the main improvement of the proposed method is to adopt a novel protein feature representation and utilizing a powerful classifier. In addition, good experiment results indicate that the proposed method performs well in PPIs prediction and has great generalization ability.

Competing Interests. The authors declare that they have no competing interests.