1 Introduction

Proteins are the most crucial macromolecules of life and participate in almost all process within the cell, including signaling cascades, DNA transcription and replication, and metabolic cycles (Yin et al. 2014). It has been confirmed that proteins rarely perform their functions independently. Instead, they cooperate with other macromolecules, and especially other proteins perform their functions by forming a huge network of protein–protein interactions (PPIs) (Gavin et al. 2002). Therefore, the study of protein–protein interactions has been the central issue in system biology (Theofilatos et al. 2011; Tuncbag et al. 2009; Zhu et al. 2011; You 2010).

In recent years, a variety of experimental techniques have been developed for the detection of protein interactions, such as Yeast two-hybrid screens (Ito et al. 2001; Ho et al. 2002; Krogan et al. 2006), protein chip (Zhu et al. 2001), mass spectrometric protein complex identification (Ho et al. 2002). These experimental methods revealed many unknown interactions. However, due to the limitations of high false positive rate, time-consuming, high cost and low coverage, the results obtained by the experimental method are far from the expectations of the researchers. Therefore, there is an urgent need to develop reliable computational methods for predicting protein–protein interactions as a complement of the experimental methods to solve these problems in the post-genomic era (Jin 2000; Guo et al. 2008; You et al. 2010, 2016; Shen et al. 2007; Zhu 2015).

At the same time, a number of computational methods have been proposed for the prediction of protein–protein interactions (Ji et al. 2014; Zhu et al. 2013a, b; Lin et al. 2013; Jin et al. 2002). These methods can be divided into the phylogenetic profile method (Pazos and Valencia 2001), the information of gene neighboring (Ideker et al. 2002), the interacting proteins coevolution method (Pazos et al. 1997), phylogenetic relationship (Pazos et al. 1997), the literature mining method (Marcotte et al. 2001), gene fusion events (Enright et al. 1999), gene co-expression (Ideker et al. 2002) and so on (Zhu et al. 2015). However, the application of these methods is limited (Yin et al. 2008; Mao et al. 2007) because these methods depend on the prior information of the protein pairs. Therefore, the method of obtaining information directly from the protein amino acid sequence has attracted more and more researchers’ interest (Guo et al. 2008; Shen et al. 2007; Zhu et al. 2015; Bock and Gough 2003; Martin et al. 2005; Zhang et al. 2012; Nanni 2005; Nanni and Lumini 2006; Gao et al. 2016; Jin and Sendhoff 2008). At present, many researchers are engaged in the development of sequence-based methods to predict protein interactions. For example, You et al. (2013) developed the principal component analysis–ensemble extreme learning machine (PCA–EELM) method uses only protein sequence information to predict PPIs. This method yields 87.00% prediction accuracy, 86.15% sensitivity and 87.59% precision when performed on the PPIs data of Saccharomyces cerevisiae. Shen et al. (2007) considered the residues local environments and developed a SVM model by combining a conjoint triad feature with S-kernel function of protein pairs to predict PPIs network. This model has yielded a high prediction accuracy of 83.93% when performed on human PPIs data set. Martin et al. (2005) proposed a model which is a product of subsequences and an expansion of the signature descriptor from chemical information to detect PPIs. The accuracy obtained by this model was 80% when using tenfold cross-validation on the Yeast data sets.

In this paper, we propose a sequence-based method to predict protein–protein interactions, which based on the feature weighted rotation forest algorithm (FWRF) (Rodriguez and Kuncheva 2006; Nanni and Lumini 2009). In particular, we convert the protein amino acid sequence into the Position-Specific Scoring Matrix (PSSM) (Gribskov et al. 1987), and then put the features which extracted by local phase quantization (LPQ) (Ojansivu and Heikkila 2008) into the feature weighted rotation forest classifier to predict PPIs. In the data set, the importance of features is inconsistent, some features contain more information, some contain less information, and even some contain interference information. If the features containing the interference information can be eliminated, the classifier will be easier to classify the samples. Based on this idea, we added the concept of weight to the feature and improved the original rotation forest algorithm. After the feature is assigned a weight, the new algorithm can remove the features of the low value, namely noise, so the classifier can obtain more accurate information from the feature subset. In the experiment, we tested the improved algorithm on different data sets, and compared with other excellent methods. The experimental results show that our feature weighted rotation forest algorithm does improve the accuracy of the classification and the efficiency of calculation.

2 Materials and methodology

2.1 Data sources

We implement our method on H. pylori data set, which introduced by Rain et al. (2001). The H. pylori data set consists of 2916 protein pairs, of which 1458 pairs are interacting and 1458 pairs are non-interacting. This data set provides a platform for comparing our method with other methods, and can be downloaded at http://www.cs.sandia.gov/~smartin/software.html.The Yeast data were extracted from S. cerevisiae core subset of database of interacting proteins (DIP) (Xenarios et al. 2002), version DIP_20070219. After removing the protein pairs with less than 50 residues and greater than 40% sequence identity, the final Yeast data set contains 5594 positive samples and 5594 negative samples.

2.2 Position-Specific Scoring Matrix (PSSM)

Position-Specific Scoring Matrix (PSSM) was produced from a set of sequences previously aligned by sequence or structural similarity. It was used to detect distantly related protein introduced by Gribskov et al. (1987). A PSSM is a matrix of \(L\times 20\), where row represents the total number of amino acids in a protein and column represents the 20 naive amino acids. Let \(\hbox {PSSM}=\left\{ {P_{i,j} :i=1\ldots L{\text { and }}j=1\ldots 20} \right\} \) and each matrix is represented as follows:

$$\begin{aligned} \hbox {PSSM}=\left[ {{\begin{array}{cccc} P_{1,1}&{} {P_{1,2}}&{} {\cdots }&{} {P_{1,20}} \\ P_{2,1}&{} {P_{2,2}}&{} {\cdots }&{} {P_{2,20}} \\ {\vdots }&{} {\vdots }&{} {\vdots }&{} {\vdots } \\ {P_{L,1}}&{} {P_{L,2}} &{} {\cdots }&{} {P_{L,20}} \\ \end{array}}} \right] \end{aligned}$$
(1)

where \(P_{i,j}\) in the i row of PSSM mean that the probability of the ith residue being mutated into type j of 20 native amino acids during the procession of evolutionary in the protein from multiple sequence alignments.

In our study, we used the Position-Specific Iterated BLAST (PSI-BLAST) tool (Altschul et al. 1997) and SwissProt database on a local machine to create PSSM for each protein sequence. PSI-BLAST is a highly sensitive protein sequence alignment program, particularly effective in the discovery of new members of a protein family and similar protein of distantly related species. In order to obtain broad and high homologous sequences, we set the value of e-value is 0.001, the number of iterations is 3 and the value of other parameters are default. Applications of PSI-BLAST and SwissProt database can be downloaded at http://blast.ncbi.nlm.nih.gov/Blast.cgi.

2.3 Local phase quantization (LPQ)

Local phase quantization (LPQ) was originally described in the article (Ojansivu and Heikkila 2008) for texture description by Ojansivu and Heikkila. The LPQ method is based on the blur invariance property of the Fourier phase spectrum (Wang et al. 2015; Li et al. 2011; Li and Olson 2010). The discrete model for spatially invariant blurring of an original image f(x) apparent in an observed image g(x) can be expressed as a convolution, the formula is as follows:

$$\begin{aligned} g(x)=f(x)\times h(x) \end{aligned}$$
(2)

where x is a vector of coordinates \(\left[ {x,y} \right] ^\mathrm{T}\), \(\times \) indicates two-dimensional convolutions and h(x) represents the point spread function (PSF) of the blur. In the Fourier domain, this is equivalent to

$$\begin{aligned} G(u)=F(u)\cdot H(u) \end{aligned}$$
(3)

where u represents a vector of coordinates \(\left[ {u,v} \right] ^\mathrm{T}\), G(u), F(u) and H(u) are the discrete Fourier transforms (DFT) of the blurred image, the original images and the point spread function (PSF), respectively. Consider only the phase of the spectrum, the phase relations can be expressed as

$$\begin{aligned} \angle G(u)=\angle F(u)+\angle H(u) \end{aligned}$$
(4)

If the spread point function h(x) is centrally symmetric, \(H\in \left\{ {0,\pi } \right\} \) as the Fourier transform H(u) is always real. So its phase can only be represented as a two-valued function:

$$\begin{aligned} \angle H(u)=\left\{ {{\begin{array}{l@{\quad }l} 0 &{}\hbox {if}\, H(u)\ge 0 \\ \pi &{} \hbox {if}\, H(u)<0 \\ \end{array}}} \right. \end{aligned}$$
(5)

This means that

$$\begin{aligned} \angle H(u)=\angle F(u) \quad \hbox {if} \ H(u)\ge 0 \end{aligned}$$
(6)

In local phase quantization, a short-term Fourier transform (STFT) computed over a rectangular M-by-M neighborhoods \(N_x \) at each pixel position of an image f(x) is defined by:

$$\begin{aligned} \hbox {F}\left( {\hbox {u},\hbox {x}} \right) =\sum \limits _{y\in N_x} f\left( {x-y} \right) \hbox {e}^{-j2\pi yu^\mathrm{T}}=\omega _u^\mathrm{T} f_x \end{aligned}$$
(7)

where \(f_x \) is another vector containing all \(M^{2}\) image samples from \(N_x \), \(\omega _u \) is the basis vector of the two-dimensional DFT at frequency u.

The local Fourier coefficients are computed at four frequency points: \(u_1 =\left[ {a,0} \right] ^\mathrm{T}\), \(u_2 =\left[ {0,a} \right] ^\mathrm{T}\), \(u_3 =\left[ {a,a} \right] ^\mathrm{T}\), and \(u_4 =\left[ {a,-a} \right] ^\mathrm{T}\), where a is a sufficiently small scalar to satisfy \(H\left( {u_i} \right) >0\). So each pixel point can be expressed as a vector, given by:

$$\begin{aligned} F_x^c= & {} \left[ {F\left( {u_1,x} \right) , F\left( {u_2,x} \right) , F\left( {u_3,x} \right) , F\left( {u_4,x} \right) } \right] , \nonumber \\ F_x= & {} \left[ {Re\left\{ {F_x^c} \right\} , \hbox {Im}\left\{ {F_x^c} \right\} } \right] ^\mathrm{T}. \end{aligned}$$
(8)

Using a simple scalar quantizer, the resulting vectors are quantized:

$$\begin{aligned} q_j (x)=\left\{ {{\begin{array}{l@{\quad }l} 1, &{} \hbox {if}\, g_j (x)\ge 0 \\ 0, &{} \hbox {otherwise} \\ \end{array}}} \right. \end{aligned}$$
(9)

where \(g_j (x)\) is the jth component of \(F_x \). After quantization, \(F_x \) becomes an eight-bit binary number vector, and each component is assigned a weight of \(2^{j}\). The resulting eight binary coefficients are represented as integer values between 0 and 255 using binary coding

$$\begin{aligned} f_\mathrm{LPQ} (x)=\sum \limits _0^7 q_j (x)2^{j} \end{aligned}$$
(10)

From all these values, we obtain a histogram that can be represented as a 256-dimensional feature vector. In this article, each PSSM matrix from the H. pylori data set is converted to 256-dimensional feature vectors by using the LPQ method.

3 Feature weighted rotation forest algorithm

3.1 Original rotation forest classifier

Rotation forest (RF) is a popular ensemble classifier. In order to generate the training samples of the base classifier, the feature set is randomly divided into K subsets. The linear transformation method is applied in each subset, and retains all the principal components to maintain the precision of data. The rotation formed the training sample of new features to ensure the diversity of data. Hence the rotation forest can enhance the accuracy for individual classifier and the diversity in the ensemble at the same time. The framework of RF is described as follows.

Let X is the training sample set, Y is the corresponding labels and F is the feature set. Assuming \(\left\{ {x_i,y_i} \right\} \) contains D training samples, where in \(x_i =\left( {x_{i1},x_{i2} ,\ldots ,x_{in}} \right) \) be an n-dimensional feature vector. Then X is \(D\times n\) matrix, which is composed of n observation feature vector composition. The feature set is randomly divided into K equal subsets by a suitable factor. Let the number of decision trees is L, then the decision trees in the forest can be represented as \(T_1,T_2,\ldots ,T_L\). The implementation process of the algorithm is as follows.

  1. (1)

    Select the suitable parameter K which is a factor of n, let F randomly divided into K parts of the disjoint subsets, each subset contains a number of features is \(N = n/k\).

  2. (2)

    From the training dataset X select the corresponding column of the feature in the subset \(T_{i,j} \), form a new matrix \(X_{i,j} \). Followed by a bootstrap subset of objects extracted 3 / 4of X constituting a new training set \({X}^{\prime }_{i,j}\).

  3. (3)

    Matrix \({X}^{\prime }_{i,j}\) is used as the feature transform for producing the coefficients in a matrix \(P_{i,j} \), which jth column coefficient as the characteristic component jth.

  4. (4)

    The coefficients obtained in the matrix \(P_{i,j} \) are constructed a sparse rotation matrix \(S_i \), which is expressed as follows:

    $$\begin{aligned} S_i =\left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} f_{i,1}^{(1)},\ldots , f_{i,1}^{({N_1})} &{}0 &{} \cdots &{}0 \\ 0 &{}f_{i,2}^{(1)},\cdots ,f_{i,2}^{({N_2})} &{} \cdots &{}0 \\ \vdots &{}\vdots &{} \ddots &{}\vdots \\ 0&{}0&{}\cdots &{} f_{i,k}^{(1)},\ldots ,f_{i,k}^{({N_k})} \\ \end{array}\right] \nonumber \\ \end{aligned}$$
    (11)

In the prediction period, provided the test sample x, generated by the classifier \(T_i \) of \(d_{i,j} \left( {XS_i^f} \right) \) to determine x belongs to class \(y_i \). Next, the class of confidence is calculated by means of the average combination, and the formula is as follows:

$$\begin{aligned} \alpha _j (x)=\frac{1}{L}\sum \limits _{i=1}^L d_{i,j} \left( {XS_i^f} \right) \end{aligned}$$
(12)

Then assign the category with the largest \(\alpha _j (x)\) value to x.

3.2 Improved rotation forest with weighted feature selection

With the rapid development of technology, we get more and more features, the feature dimension is also more and more high. The high-dimensional data contain more information and reflect the fact more comprehensive. However, with the increase in dimensions, redundant information is also increasing. In view of how to effectively analyze and integrate information, extract useful information from a large amount of noise data, combined with feature selection technology, we improved the rotation forest algorithm. We use \({\chi }^{2}\) statistical method to calculate the weight of the features. A feature F against the class feature is computed using the following formula:

$$\begin{aligned} \chi ^{2}=\sum \limits _{i=1}^n \sum \limits _{j=1}^2 \frac{\left( {u_{ij} -v_{ij}} \right) ^{2}}{v_{i,j}} \end{aligned}$$
(13)

where n is the number of values in feature F, \(u_{ij} \) is the count of the value \(\lambda _i \) in feature F and belong to class \(c_j \), defined as:

$$\begin{aligned} u_{ij} =\hbox {count}\left( {F=\lambda _i \; \hbox {and}\; C=c_j} \right) \end{aligned}$$
(14)

\(v_{i,j} \) is the expected value of \(\lambda _i \) and \(c_j \), defined as:

$$\begin{aligned} v_{i,j} =\frac{\hbox {count}\left( {F=\lambda _i} \right) \times \hbox {count}\left( {C=c_j} \right) }{N} \end{aligned}$$
(15)

where \(\hbox {count}\left( {F=\lambda _i} \right) \) is the number of samples in the feature F value is \(\lambda _i \), \(\hbox {count}\left( {C=c_j} \right) \) is the number of samples in the class C value is \(c_j\), and N is the total number of samples in the training set.

In order to make full use of the useful information, we perform the following steps. Firstly, calculate the weight of each feature by the formula 13; secondly, sorts the features in descending order according to the weight value; finally, select new features from the full feature set in accordance with a given feature selection rate r. After these steps, we construct a new data set for the algorithm. Using the new data set can not only effectively eliminate the noise, especially the high-dimensional data, but also can reduce the computation time.

Fig. 1
figure 1

Influence of feature selection on classifier performance in H. pylori data set

Fig. 2
figure 2

Influence of the value of feature selection rates on time cost in H. pylori data set

4 Results and discussion

4.1 Selection of the number of features

In the experiment, fivefold cross-validation is used to evaluate the performance of our method. The whole data set is randomly split into 5 equal-sized subsets. In order to ensure the independence of the data in the experiment, we select one subset as the test set, and the other four subsets as the training set. Loop 5 times in this way, such that each subset is used for testing exactly once. In our method, the number of features is controlled by the feature selection rate r. Different feature selection rates will lead to different accuracy and computation time. We take one subset as the test set, and the rest for the training set to experiment when selecting the parameters. The range of value r is from 0 to 1. In order to prevent the loss of too much information, in the experiment we test the value of \(r = 0.2, 0.3, \ldots , 1.0\). Figure 1 plots the influence of feature selection on classifier performance and Fig. 2 plots the influence of feature selection on time cost.

Figure 1 shows that the accuracy increases slowly as r increases when the feature selection rate r is less than 0.3. The reason is that the number of features increases with increasing r, and more information is provided to the classifier. The accuracy reaches its peak when the rate is 0.4. This means that the feature set has contained the most useful information. The accuracy gradually decreases with increasing rates when the rate is greater than 0.4. This proves that the high-dimensional data are likely to contain noise. It can be seen from the figure that \(r=0.4\) is the best setting of this experiment.

Figure 2 shows the influence of feature selection on time cost. It is obvious from the figure that with the increase in the feature select rate, namely the increase in the number of features, the computation time continues to rise. This indicates that the dimension of the data on the impact of computing time is very large.

4.2 Select the parameters of the FWRF classifier

Another problem affecting the performance of our model is the number of decision trees L and the number of feature subset K in the feature weighted rotation forest. Since a large number of trees and feature subsets will lead to considerable computational cost and will affect the accuracy of the improved, we need to find the most suitable parameters by the grid search method. Figure 3 shows the predicted results of different parameters. We can see that with the increase of L, the accuracy rate increases rapidly at the beginning, and then tends to be gentle. However, there is no significant change in accuracy as K increased. Considering the time cost and accuracy of the algorithm, we finally select the most appropriate parameters \(K=3\) and \(L=53\).

Fig. 3
figure 3

Accuracy surface obtained of the feature weighted rotation forest for optimizing regularization parameters K and L

Table 1 Fivefold cross-validation results obtained by using proposed method on H. pylori data set
Fig. 4
figure 4

ROC curves performed by proposed method on H. pylori data set

Table 2 Fivefold cross-validation results obtained by using the original rotation forest classifier on H. pylori data set

4.3 Assessment of prediction ability

In the experiment, the evaluation criteria are reflected by the prediction accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and Matthews correlation coefficient (MCC). The calculation formulas are listed below:

$$\begin{aligned}&\hbox {Accu.}=\frac{\hbox {TP}+\hbox {TN}}{\hbox {TP}+\hbox {TN}+\hbox {FP}+\hbox {FN}} \end{aligned}$$
(16)
$$\begin{aligned}&\hbox {Sen.}=\frac{\hbox {TP}}{\hbox {TP}+\hbox {FN}} \end{aligned}$$
(17)
$$\begin{aligned}&\hbox {Prec.}=\frac{\hbox {TP}}{\hbox {TP}+\hbox {FP}} \end{aligned}$$
(18)
$$\begin{aligned}&\hbox {MCC}=\frac{\hbox {TP}\times \hbox {TN}-\hbox {FP}\times \hbox {FN}}{\sqrt{\left( {\hbox {TP}+\hbox {FP}}\right) \left( {\hbox {TP}+\hbox {FN}} \right) \left( {\hbox {TN}+\hbox {FP}} \right) \left( {\hbox {TN}+\hbox {FN}} \right) }}\nonumber \\ \end{aligned}$$
(19)

where TP, TN, FP, FN represent the number of true positives, true negatives, false positives and false negatives, respectively. Besides, the receiver operating characteristic (ROC) curve (Zweig and Campbell 1993) and the area of the ROC curve (AUC) are employed to visually show the performance of classifier.

Our prediction results are shown in Table 1. We can see that the average accuracy, precision, sensitivity, MCC and AUC of 91.91, 92.51, 91.22, 83.84 and 91.60%, respectively. The standard deviations of them are 1.21, 1.83, 1.92, 2.44 and 0.55%, respectively. The ROC curves performed on H. pylori data set was shown in Fig. 4. In this figure, X-ray depicts false positive rate (FPR) while Y-ray depicts true positive rate (TPR).

4.4 Comparison with original RF and SVM

To evaluate the performance of our method, we compared with the original rotation forest classifier and the state-of-the-art SVM classifier. In the comparison, we use the same feature extraction method and implement on the same data set. After optimization of the grid search method, the parameters K and L of the original rotation forest are set to 2 and 3, the parameters c and g of the SVM are set to 0.08 and 22. Table 2 lists the experimental results of the original rotation forest classifier. From the table we can see that the average accuracy, precision, sensitivity, MCC, AUC and their standard deviations of \(85.36 \pm 1.61\), \(85.29 \pm 2.00\), \(85.37 \pm 3.01\), \(70.72 \pm 3.26\) and \(85.61 \pm 2.67\)%, respectively. Table 3 lists the experimental results of the SVM classifier. From the table we can see that the average accuracy, precision, sensitivity, MCC, AUC and their standard deviations of \(81.82 \pm 2.19\), \(82.82 \pm 4.15\), \(80.29 \pm 1.03\), \(63.71 \pm 4.48\) and \(88.83 \pm 1.94\)%, respectively.

Table 3 Fivefold cross-validation results obtained by using SVM classifier on H. pylori data set
Fig. 5
figure 5

ROC curves performed by the original rotation forest classifier on H. pylori data set

Fig. 6
figure 6

ROC curves performed by the SVM classifier on H. pylori data set

Table 4 Performance comparison of different methods on H. pylori data set

From the table it can be seen that the performance of our FWRF showed significant improvement over the other two classifiers. The average accuracy is 6.55% higher than original RF, and 10.09% higher than SVM. This was due to the fact that there may contain a lot of noise information in the feature set, which will affect the accuracy of the classifier, so the original RF and SVM did not perform well on this kind of feature set. In addition, the processed feature set is likely to be smaller than the original feature set, which reduces the dimension of the data, so it can reduce the running time of the classifier and improve efficiency (Figs. 56).

4.5 Comparison of the proposed method with other existing methods

Many methods have been proposed for predicting protein interactions, and good results have been obtained. In this section, we focused on the H. pylori data set to compare the FWRF with other algorithms. Table 4 lists the average prediction results of the other six different methods on the H. pylori data set. We can see that the accuracy values obtained by these methods are between 75.8 and 87.50%. Their average accuracy rate is 82.80, 9.11% lower than ours. The average precision, sensitivity and MCC values of these methods are lower than our method, which are 83.79, 81.95, 74.38%, respectively.

4.6 Performance of the feature weighted rotation forest on Yeast data set

In order to further evaluate the performance of the feature weighted rotation forest classifier, we verified it on Yeast data set. In the experiment, the features of the Yeast data set are also extracted by the LPQ algorithm. Figure 7 shows the experimental results for the different selection ratios of the feature weighted rotation forest classifier. From the figure we can see that the feature weighted rotation forest classifier performs well. The accuracy, precision specificity and MCC would be increased along with the increasing of the selection ratio when the selection ratio less than 90%; the peak is reached when the selection ratio is 90%; and decreased when the selection ratio more than 90%. This experiment demonstrates that our improved algorithm is equally applicable to Yeast data set. In addition, we can see that the features extracted by the LPQ algorithm in the Yeast data set contain less noise than in the H. pylori data set.

Fig. 7
figure 7

Influence of feature selection on classifier performance in Yeast data set

5 Conclusion

In this paper, we propose a novel improved rotation forest algorithm based on weighted feature selection strategy to predict the interactions among proteins. The improved algorithm considers that the high-dimensional data of PPIs may contain noise. To solve this problem, we distinguish the importance of different features by weight, and delete the features with small weight according to the selection ratio, that is, the noise features. This strategy can improve the accuracy of the classifier while reducing the dimension of the data and saving the execution time. In the experiment, we verify its capability on H. pylori and Yeast data set, and compare it with original rotation forest classifier, SVM classifier and other existing methods. Excellent experimental results show that our method is effective and efficient. In future research, we intend to apply the feature weighted rotation forest algorithm in more areas and look forward to gooderformance.