Introduction

Protein–protein interactions are responsible for carrying out biochemical activities in living systems (Ahmed et al. 2015; Marceau et al. 2013). Previous studies have validated that protein–protein interactions play critical roles in the life cycles of living cells, such as genetic material duplication, regulation of gene expression, cell signal transduction, metabolism, organism growth and reproduction, cell apoptosis, and cell necrosis (Fry 2015; Sharon and Sinz 2015). Therefore, the study of how protein–protein interactions form intermolecular regulatory networks, including genetic regulatory pathways, metabolism, and signal transduction pathway, is of great biological significance (Betel et al. 2007; Hall et al. 2007; Hu et al. 2011; Jia et al. 2015a; Skrabanek et al. 2008). This research will not only help in further understanding various biological processes and mechanisms from a systematic point of view (Gromiha et al. 2009; Yugandhar and Gromiha 2014a, b) but can also help identify new drug targets and pave the way for the development of new drugs (Ako-Adjei et al. 2015; Burgoyne and Jackson 2006; Russell and Aloy 2008).

In preliminary studies, the determination of whether two proteins would interact with each other and which residues were interactive, i.e., protein–protein interaction sites (PPIs), was mostly confined to biological experimental methods (Drewes and Bouwmeester 2003). However, these wet lab methods are difficult to apply to all living organisms because they are both time- and cost-consuming (Edwards et al. 2002; Friedrich et al. 2006). In addition, biological wet lab methods for identifying PPIs tend to involve risks of high false positive and false negative results (Ito et al. 2001; Von Mering et al. 2002). In recent years, researchers have investigated the possibility of utilizing computational approaches to rapidly and accurately predict PPIs on large-scale protein datasets (Ito et al. 2000). Over the past decades, a number of machine learning algorithms, such as neural networks (NNs) (Fariselli et al. 2002), support vector machines (SVMs) (Bradford and Westhead 2005; Wang et al. 2006; Yan et al. 2004), and random forests (RFs) (Jia et al. 2015c; Šikic et al. 2009), have been successfully applied to PPI prediction, and many PPI predictors have emerged (Dhole et al. 2014; Murakami and Mizuguchi 2010a; Ofran and Rost 2007; Porollo and Meller 2007; Singh et al. 2014).

Existing machine-learning-based PPI predictors take protein sequential features, structural features, or both as the inputs of prediction models. Because protein 3D structures can provide intuitive and effective clues, they are generally preferred for performing PPI prediction (Agrawal et al. 2014; Cukuroglu et al. 2014; Sudha et al. 2014). For example, Jones and Thornton (1997a, b) carefully analyzed a series of residue patches on the surface of protein 3D structures using six parameters (residue interface propensity, solvation potential, hydrophobicity, planarity, protrusion, and accessible surface area) and proposed a method for calculating the relative combined score of a surface patch for forming protein–protein interactions. Bradford et al. proposed a support vector machine (SVM) predictor based on surface patch analysis (Bradford and Westhead 2005) and then further improved the prediction performance using a Bayesian network (Bradford et al. 2006). Chen et al. (2012) constructed three-dimensional probability density maps of non-covalent interacting atoms on protein surfaces and then applied machine learning algorithms to learn the characteristic patterns of the probability density maps specific to PPIs. However, the number of known protein 3D structures is still considerably smaller than that of sequenced proteins in spite of great efforts made in determining protein structures, which significantly limits the applicability of structure-based PPI prediction (Murakami and Mizuguchi 2010a).

Recently, much attention has been paid to sequence-based PPI prediction, and some progress has been made (Bock and Gough 2001; Zhou and Shan 2001). A number of promising PPI predictors that utilize sequence-derived features and machine-learning algorithms have emerged (Jia et al. 2015a, b; Murakami and Mizuguchi 2010a; Ofran and Rost 2007) (Dhole et al. 2014; Murakami and Mizuguchi 2010a; Singh et al. 2014). Ofran and Rost developed a neural network method called ISIS (Ofran and Rost 2007) for predicting PPIs based on the predicted structural features and evolutionary information calculated from the sub-sequences of nine consecutive residues. Porollo and Meller developed a predictor named SPPIDER (Porollo and Meller 2007) using SVMs and neural networks based on relative solvent accessibility (RSA), and they claimed that the RSA feature possesses a better discrimination capability than that of evolutionary conservation, physicochemical characteristics, structure-derived features, and other previously considered features for performing PPI prediction. Murakami and Mizuguchi used kernel density estimation to construct a naïve Bayesian classifier named PSIVER (Murakami and Mizuguchi 2010a) with position-specific scoring matrices (PSSM) and predicted accessibility (PA) as feature sources. Recently, Dhole et al. implemented SPRINGS (Singh et al. 2014) and LORIS (Dhole et al. 2014) to identify PPIs by applying artificial neural networks and L1-regularized logistic regression, respectively, based on evolutionary conservation, predicted relative solvent accessibility and averaged cumulative hydropathy features.

Overall, significant achievements have been made in the prediction of PPIs. Nevertheless, the performance of PPI prediction is still far from satisfactory, and there is still room for further improvement. In addition, we carefully analyzed the existing machine-learning-based PPIs predictors and observed that the severe class imbalance phenomenon, in which the number of majority samples (non-interacting residues) significantly outnumbers that of minority samples (interacting residues), has not been well considered (Ofran and Rost 2007; Porollo and Meller 2007; Šikić et al. 2009), which may potentially deteriorate the performance of machine-learning-based PPI predictors.

Motivated by all these observations, we proposed a new machine-learning-based method for further improving the performance of sequence-based PPI prediction. The proposed method mainly consists of three steps: First, a novel data-cleaning procedure is developed to relieve the severity of class imbalance by removing those difficult marginal targets from the majority samples in the original training dataset using a machine-leaning model; second, based on the cleaned dataset, a machine-learning-based PPI prediction engine is trained; finally, a post-filtering procedure is applied to the results of the prediction engine to reduce false positive predictions. We performed stringent computer experiments on benchmark datasets with both cross-validation and independent validation tests, and the results demonstrated the feasibility and efficacy of the proposed method.

According to Chou’s 5-step rule (Chou 2011), which has been implemented in a series of recent publications (Chen et al. 2014; Jia et al. 2015b; Lin et al. 2014; Liu et al. 2015a; Xu et al. 2014), to establish a useful sequence-based statistical predictor for a biological system, the following five guidelines should be followed (Chou 2011): (a) construct or select a valid benchmark dataset to train and test the predictor; (b) formulate biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm (or engine) to perform the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (e) establish a user-friendly web server for the predictor that is accessible to the public. Below, we describe how to address these steps systematically.

Materials and Methods

Benchmark Datasets

In this study, we benchmarked the proposed method on three widely used datasets, denoted Dset186, Dtestset72, and PDBtestset164, to demonstrate the method’s feasibility and effectiveness. Among the three datasets, Dset186 was used as a training dataset and the remaining two, i.e., Dtestset72 and PDBtestset164, were used as independent validation datasets. Dset186 was previously constructed by Murakami and Mizuguchi and consists of 186 non-redundant (sequence identity <25 %), heterodimeric, non-transmembrane, and transient protein chains, which have been structurally resolved by X-ray crystallography with a resolution of ≤3.0 Å (Murakami and Mizuguchi 2010a). The interacting residue in the protein chains was defined as a residue that lost absolute solvent accessibility of <1.0 Å2 upon complex formation (Singh et al. 2014).

Dtestset72 (Murakami and Mizuguchi 2010b) consists of 72 non-redundant sequences that are non-overlapping with sequences in Dset186. Dtestset72 was constructed based on the protein–protein docking benchmark set version 3.0 (Hwang et al. 2008) using a homology reduction procedure: Any sequences displaying ≥25 % sequence identity over a 90 % overlap with any of the sequences in Dset186 were removed using BLASTClust (Altschul et al. 1997). The obtained Dtestset72 includes rigid body cases (27 protein complexes), medium cases (6 protein complexes), and difficult cases (3 protein complexes). In these cases, each protein complexes consists of two protein chains.

To further explore the prediction performance of PPI prediction models on newly annotated proteins, another independent validation dataset, denoted PDBtestset164, built by Singh et al. (2014) was also used. The PDBtestset164 dataset was obtained using newly annotated proteins from June 2010 to November 2013. The same filter used to obtain Dset186 and Dtestset72 was applied to create PDBtestset164 (Singh et al. 2014). PDBtestset164 consists of non-redundant 164 protein chains extracted from newly deposited proteins (from June 2010 to November 2013) in the Protein Data Bank (PDB) with the same filters that were applied to construct Dset186 and Dtestset72. The software PSAIA (Protein Structure and Interaction Analyzer) (Mihel et al. 2008) was used to identify interacting residues of the protein sequences in PDBtestset64. Because PDBtestset164 consists of new proteins released after the construction of Dset186, we used it as the second independent validation dataset to further evaluate the generalization capability of the proposed method. For details about PDBtestset164, please refer to (Dhole et al. 2014; Singh et al. 2014). Table 1 summarizes the statistics of the three benchmark datasets.

Table 1 Composition of the training dataset and the two independent validation datasets

Feature Extraction

To develop a machine-learning-based PPI predictor, a critical step was to represent each residue as a discriminative feature vector. In this study, three feature sources, i.e., position-specific scoring matrix, averaged cumulative hydropathy, and predicted relative solvent accessibility, that have been demonstrated to be useful for PPI prediction were used to construct the discriminative feature for each sample (i.e., a residue in a protein sequence).

Evolutionary information contained in a protein sequence has been demonstrated to be useful for many protein attributes prediction problems, including PPI prediction (Chen and Jeong 2009; Dhole et al. 2014; Murakami and Mizuguchi 2010a; Yan et al. 2003; Yu et al. 2013a). Position-specific scoring matrix (PSSM) obtained by multiple sequence alignment can partially provide the evolutionary information of a protein sequence (Yu et al. 2011). For a given protein sequence, we generated a corresponding L × 20 PSSM using PSI-BLAST (Schäffer et al. 2001) to search the Swiss-Prot database through three iterations, with 0.001 as the E value cutoff for multiple sequence alignment against the sequence (He et al. 2015; Xiao et al. 2015b), where L is the length of the protein sequence. The original PSSM of a protein sequence with L residues generated by PSI-BLAST, denoted as \( {\text{P}}_{{{\text{original\_pssm}}}} \), can be formulated as follows:

$$ {\text{P}}_{{{\text{original\_pssm}}}} = \left[ {\begin{array}{*{20}c} {p_{1,1} } & {p_{1,2} } & \cdots & {p_{1,20} } \\ {p_{2,1} } & {p_{2,2} } & \cdots & {p_{2,20} } \\ \vdots & \vdots & \vdots & \vdots \\ {p_{k,1} } & {p_{k,2} } & \cdots & {p_{{{\text{k}},20}} } \\ \vdots & \vdots & \vdots & \vdots \\ {p_{L,1} } & {p_{L,2} } & \cdots & {p_{L,20} } \\ \end{array} } \right], $$
(1)

where \( p_{k,j} \) represents the score of the residue k in the protein sequence being mutated to residue type j during the evolution process. A positive score indicates that the corresponding mutation occurs more frequently than expected by chance, whereas a negative score indicates the opposite. Note that here we use the numerical code 1, 2, …, 20 to represent the 20 native amino acid types according to the alphabetical order of their single character codes (Zou and Xiao 2015). After obtaining the original PSSM, we further normalized each element of \( {\text{P}}_{{{\text{original\_pssm}}}} \) to the range (0, 1) with the following logistic function:

$$ f(x) = \frac{1}{{1 + e^{ - x} }}, $$
(2)

where x is the score in the original PSSM.

A sliding window of size W was then applied to the normalized PSSM to extract feature vectors for each residue of the protein sequence. According to (Dhole et al. 2014), W = 9 is a better choice for performing PPI prediction. Therefore, we set W = 9 in this study. Accordingly, the dimensionality of the obtained PSSM feature vector for a residue is 9 × 20 = 180-D.

Researchers have found that protein–protein interaction interfaces are generally hydrophobic patches on the surfaces of proteins (Chothia and Janin 1975; Jones and Thornton 1995, 1997a). Hence, the hydropathy index should be beneficial to the identification of protein–protein interaction sites, which has been demonstrated by related studies (Dhole et al. 2014; Singh et al. 2014). In this study, the hydropathy index proposed by Kyte and Doolittle (1982) was used. The hydropathy index of a residue represents the hydrophobic or hydrophilic properties of a residue’s side chain. The hydropathy index of residue is an indicator of hydrophilic and hydrophobic properties (Gallet et al. 2000). We explored the hydropathy indices of a residue and its neighborhood to extract the residues averaged cumulative hydropathy (ACH) feature. More specifically, for a target residue, its five ACH indices corresponding to five windows of different sizes (i.e., sizes of 1, 3, 5, 7, and 9) centered on the residue were calculated using the Python codes provided by Dhole et al. (2014); the five ACH indices were then concatenated to form a 5-D feature vector for the target residue.

We extracted the predicted relative solvent accessibility (PRSA) features of residues with the SANN web server developed by Joo et al. (2012). The SANN web server is freely available at http://lee.kias.re.kr/~newton/sann/. For a given protein sequence, SANN predicts the discrete states (two or three states) and a continuous value of solvent accessibility for each residue in the sequence (Dhole et al. 2014). In this study, we used the predicted continuous value of solvent accessibility to encode each residue into a 1-D PRSA feature.

Finally, a residue can be encoded into a 186-D feature vector by serially combining its 180-D PSSM feature, 5-D ACH feature and 1-D PRSA feature.

Data-Cleaning Procedure

The purpose of data cleaning is to remove marginal targets that are difficult to classify from the majority samples to relieve the severity of class imbalance of the original training dataset. Figure 1 illustrates the workflow of the proposed machine-learning-based data-cleaning (DC) procedure.

Fig. 1
figure 1

Workflow of the proposed machine-learning-based data-cleaning procedure

Let \( S = \left\{ {s_{1} ,s_{2} , \cdots ,s_{i} , \cdots ,s_{N} } \right\} \) be the set of N protein sequences in the original dataset, where \( s_{i} \) is the ith sequence. We use ‘+’ and ‘−’ to represent the class labels of minority and majority residues (i.e., interactive and non-interactive residues), respectively.

The proposed DC procedure removes potential marginal targets from one protein sequence each time. For the i-th sequence \( s_{i} \in S \), the proposed DC procedure first trains a prediction engine, denoted \( M_{i} \), based on the samples from \( S - \left\{ {s_{i} } \right\} \) with a machine learning algorithm; then, the trained prediction engine \( M_{i} \) is used to predict the class labels of residues of sequence \( s_{i} \); each majority residue (labeled ‘−’) of \( s_{i} \) will then be screened to determine whether it is a marginal target (a majority residue is considered marginal if it is predicted to belong to the minority class by the prediction engine \( M_{i} \)); the cleaned sequence \( s_{i}^{\text{c}} \) is obtained by removing all the marginal targets from \( s_{i} \); this procedure will be repeated \( N \) times until all of the \( N \) sequences in \( S \) have been cleaned. The cleaned dataset is denoted \( S^{\text{c}} = \left\{ {s_{1}^{\text{c}} ,s_{2}^{\text{c}} , \cdots ,s_{i}^{\text{c}} , \cdots ,s_{N}^{\text{c}} } \right\} \).

In essence, any machine learning algorithm can be used to construct the prediction engine for data cleaning. In this study, the random forest (RF) (Breiman 2001) algorithm was used as an example to implement the proposed DC procedure.

We explain the rationality of the proposed DC procedure as follows. Clearly, for each sequence \( s_{i} \in S \), the prediction engine is trained on an imbalanced dataset \( S - \left\{ {s_{i} } \right\} \), where the number of majority samples is significantly larger than that of minority samples. In other words, the trained model \( M_{i} \) will be biased towards the majority class. Consequently, it is reasonable to consider a majority residue in \( s_{i} \) as a marginal target if it is still being predicted as minority under the majority-prone model \( M_{i} \).

We processed the original dataset Dset186 with the proposed DC procedure. It was observed that 13,450 majority residues were cleaned. The ratio between the number of majority samples and that of minority samples was decreased from 5.56 to 3.13, demonstrating that the severity of data imbalance was reduced.

To better understand the proposed DC procedure, we took a sequence (PDB ID: 1AY7_A) from Dset186 as an example to vividly illustrate the effect after data cleaning, as shown in Fig. 2. The images in Fig. 2 were generated using PyMOL (DeLano 2002).

Fig. 2
figure 2

Visualization of the effect of the proposed DC procedure for protein 1AY7_A. a and b are images of 3-D structures of 1AY7_A in sphere style and cartoon style, respectively, before the DC procedure; c and d are images of 3-D structures of 1AY7_A in sphere style and cartoon style, respectively, after the DC procedure. Interactive and non-interactive residues are highlighted in yellow and cyan, respectively. Marginal residues are highlighted in red. The images were generated using PyMOL (DeLano 2002) (Color figure online)

Figure 2a, b shows images of the 3-D structures of 1AY7_A in sphere style and cartoon style, respectively, before the DC procedure; on the other hand, Fig. 2c, d shows images of the 3-D structures of 1AY7_A in sphere style and cartoon style, respectively, after the DC procedure. In Fig. 2, interactive and non-interactive residues are highlighted in yellow and cyan, respectively. Note that residues highlighted in red are also non-interactive ones located inside the inner compartment of a protein. These non-interactive residues are also called non-surface residues. In this study, non-surface residue were determined by calculating their relative solvent accessibility (RSA) using NACCESS (Hubbard and Thornton 1993). A residue was considered to occur on the surface if its RSA was <5 % (Jones and Thornton 1997a, b).

From Fig. 2, two observations can be made: (1) These non-surface residues are spatially located inside the inner compartment of a protein and thus cannot interact with other proteins. On the other hand, these non-surface residues (highlighted in red) are locate close to those interactive residues (highlighted in yellow) and thus could have a negative effect on training a machine-learning prediction model with a clear classification boundary. (2) Parts of the non-interactive residues (highlighted in cyan) are also located close to those interactive residues (highlighted in yellow) and thus could have the same negative effect as the non-surface residues.

In this study, we referred to these non-interactive and non-surface residues, which are spatially located close to interactive residues and have a negative effect on training a machine-learning prediction model with a clear classification boundary, as marginal residues. As previously mentioned, on the one hand, removing these (or parts of) marginal residues can reduce the severity of data imbalance in the original training dataset; on the other hand, removing the residues can help construct a much more compact prediction model with a clear classification boundary. As an example, it was observed that 6 out of 9 non-surface residues (highlighted in red) and 5 out of 67 non-interactive residues (highlighted in cyan) were successfully removed after applying the proposed DC procedure on 1AY7_A, as shown in Fig. 2c, d.

It should be noted that there are many other methods for addressing highly imbalanced or skewed dataset, and several of which have been successfully applied to bioinformatics problems (Ertekin et al. 2007a; Ertekin et al. 2007b; Estabrooks et al. 2004; Hong et al. 2007; Hu et al. 2014; Kang and Cho 2006; Laurikkala 2001; Ting 2002; Wang and Japkowicz 2010; Wu and Chang 2005; Zhou and Liu 2010). For example, Liu et al. (2015c) developed a predictor called iDNA-Methyl for identifying DNA methylation sites with a “neighborhood cleaning rule” to address class imbalance; Xiao et al. (2015a) developed iDrug-Target, which can predict the interactions between drug compounds and target proteins using a “synthetic minority over-sampling technique”.

Training a PPI Prediction Model on the Cleaned Dataset

Based on the cleaned dataset, we can train a PPI prediction model using any suitable machine learning algorithms. For consistency with the algorithm used in the section titled ‘Data-Cleaning Procedure’ and considering the fact that random forest algorithms have been demonstrated to be particularly useful for performing PPI predictions (Jia et al. 2015c; Šikić et al. 2009), we also used a random forest algorithm as an engine for constructing a PPI prediction model.

Taking Dset186 as an example, we first obtained a cleaned dataset, denoted Dset186c, with the proposed DC procedure. As calculated in the section titled ‘Data-Cleaning Procedure’, the ratio of the number of majority samples to that of minority samples was decreased from 5.56 to 3.13. Nevertheless, a data imbalance still exists. Therefore, we further used a random under-sampling technique to balance the majority and minority samples in the cleaned training dataset, i.e., Dset186c. In this case, the ratio of the number of majority samples to that of minority samples was set to 1:1. Finally, we could train a random forest algorithm based on the balanced training dataset with all residues represented by the feature vector we developed in the section titled ‘Feature Extraction’.

During the prediction stage, for each residue in an unseen protein sequence, the trained prediction model first predicts its possibility of being interactive; then, a prescribed threshold T is applied to determine whether it is an interactive residue: a residue with a possibility of being larger than T will be predicted as interactive. The value of T is optimized by choosing the one that maximizes the value of Matthews correlation coefficients (MCC) of predictions on Dset186 over leave-one-out cross-validation.

Post-Filtering Procedure

To further improve the PPI prediction performance, a post-filtering (PF) procedure is applied to the initial predictions to reduce potential predicted false positives. This PF procedure is motivated by the following observations made in previous studies (Ofran and Rost 2003; Yan et al. 2004): From the distribution of the number of interactive residues in a window of consecutive residues centered on an interactive residue, approximately 98 % of the observed interactive residues were observed to have at least one additional interactive residue and approximately 76 % had at least four interactive residues in a window of nine consecutive residues (within four residues on either side) (Murakami and Mizuguchi 2010a; Ofran and Rost 2003; Yan et al. 2004). These observations indicate that interactive residues tend to form clusters in sequences (Ofran and Rost 2003; Yan et al. 2004), and neighboring residues of an actual interactive residue have a high potential of being interactive residues (Murakami and Mizuguchi 2010a). Thus, an isolated interactive residue predicted by a model may potentially be a false positive. Therefore, we use a window-based post-filtering (PF) procedure to eliminate those isolated predicted interactive residues to reduce potential false positives.

Figure 2 illustrates the workflow of the proposed PF procedure. More specifically, for each interactive residue predicted by the model, we place a window of size W centered on the residue; then, we calculate the number of predicted interactive residues in the window; if this number is less than m, the prediction is considered a false positive. In this study, we optimized the values of W and m by varying W from 3 to 11 and m from 1 to 5 (Murakami and Mizuguchi 2010a; Ofran and Rost 2003; Yan et al. 2004). In this study, the optimized values of W and m were set to 11 and 3, respectively (Fig. 3).

Fig. 3
figure 3

Workflow of the proposed post-filtering procedure

We acknowledge that the proposed post-filtering procedure may also reassign a true positive as a false negative. Nevertheless, the following experimental results statistically demonstrate that the overall prediction performance can be further improved by incorporating the proposed post-filtering procedure.

Evaluation Indices

Six routine evaluation indices, i.e., Recall, Precision, Specificity, Accuracy, and F-measure, were used to evaluate the prediction performance and are defined as follows:

$$ {\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} $$
(3)
$$ {\text{Specificity}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} $$
(4)
$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(5)
$$ {\text{MCC}} = \frac{{{\text{TP}} \cdot {\text{TN}} - {\text{FP}} \cdot {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right) \cdot \left( {{\text{TP}} + {\text{FN}}} \right) \cdot \left( {{\text{TN}} + {\text{FP}}} \right) \cdot \left( {{\text{TN}} + {\text{FN}}} \right)} }} $$
(6)
$$ {\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}} $$
(7)
$$ F{\text{-measure = 2}} \times \frac{{{\text{Recall}} \times {\text{Precision}}}}{{{\text{Recall}} + {\text{Precision}}}} \, $$
(8)

where TP, FP, TN, and FN denote the numbers of true positives, false positives, true negatives, and false negatives, respectively.

Although the four abovementioned metrics (Eqs. 36) have often been used in the literature to measure the prediction quality of a prediction method, they are no longer the best ones because they lack intuitiveness and are not easy to understand for most biologists, particularly the MCC (the Matthews correlation coefficient). For clarity, we adopt an additional four metrics proposed by Chou (Chou 2001; He et al. 2015; Lin et al. 2014; Liu et al. 2014; Guo et al. 2014; Chen et al. 2013):

$$ {\text{Recall}} = 1 - \frac{{N_{ - }^{ + } }}{{N^{ + } }} $$
(9)
$$ {\text{Specificity}} = 1 - \frac{{N_{ + }^{ - } }}{{N^{ - } }} $$
(10)
$$ {\text{Accuracy}} = 1 - \frac{{N_{ - }^{ + } + N_{ + }^{ - } }}{{N^{ + } + N^{ - } }} $$
(11)
$$ {\text{MCC}} = \frac{{1 - \left( {\frac{{N_{ - }^{ + } }}{{N^{ + } }} + \frac{{N_{ + }^{ - } }}{{N^{ - } }}} \right)}}{{\sqrt {\left( {1 + \frac{{N_{ + }^{-} \,-\, N_{ - }^{ + } }}{{N^{ + } }}} \right) \cdot \left( {1 + \frac{{N_{ - }^{ + } \,-\, N_{ + }^{ - } }}{{N^{ - } }}} \right)} }}, $$
(12)

where \( N_{{}}^{ + } \) represents the total number of the interacting residues investigated, whereas \( N_{ - }^{ + } \) is the number of interacting residues incorrectly predicted as non-interacting residues; \( N_{{}}^{ - } \) represents the total number of non-interacting residues investigated, whereas \( N_{ + }^{ - } \) is the number of non-interacting residues incorrectly predicted as interacting residues. We can find the relations between Eqs. 36 and Eqs. 912 as follows (Chou 2001; Lin et al. 2014; Guo et al. 2014):

$$ \left\{ \begin{aligned} {\text{TP}} &= N^{ + } - N_{ - }^{ + } \hfill \\ {\text{TN}} &= N^{ - } - N_{ + }^{ - } \hfill \\ {\text{FP}} &= N_{ + }^{ - } \hfill \\ {\text{FN}} &= N_{ - }^{ + } \hfill \\ \end{aligned} \right. $$
(13)

For a detailed understanding of Eqs. 912 and 13), please refer to (Chou 2001; He et al. 2015; Lin et al. 2014; Liu et al. 2014; Guo et al. 2014; Chen et al. 2013). Please note that the set of metrics defined in Eqs. 912 is valid only for single-label systems. For multi-label systems, which have become more common in systems biology (Lin et al. 2013) and systems medicine (Xiao et al. 2013), a completely different set of metrics, as defined by (Chou 2013), is needed.

On the one hand, the six evaluation indices (Eqs. 38) defined above are threshold dependent, i.e., their values will depend on the threshold chosen for a given prediction model. On the other hand, PPI prediction is a typical imbalanced binary prediction problem; hence, over-pursuing Accuracy is not appropriate (He and Garcia 2009a). Considering that the MCC index provides an overall measurement of the quality of binary predictions (Baldi et al. 2000), we reported these indices for a prediction model with a threshold that maximizes the MCC value of predictions.

Results and Discussion

Effectiveness of Data-Cleaning and Post-filtering Procedures

In this section, we demonstrate the effectiveness of the proposed data-cleaning and post-filtering procedures for improving the prediction performance of PPI predictions.

First, we constructed a benchmark PPIs predictor using a random forest algorithm. To eliminate the severe imbalance in the training dataset, the random under-sampling technique (RUS) was used to balance the majority and minority samples in the training dataset. In this case, the ratio of the number of majority samples to that of minority samples was set to 1:1. For convenience, we termed this benchmark prediction model RF-RUS, which indicates a model trained with the RF algorithm and random under-sampling technique (RUS).

Second, we demonstrate the efficacy of the proposed data-cleaning procedure by incorporating it into the benchmark prediction model. More specifically, we first obtained a cleaned dataset by applying the proposed data-cleaning procedure on the original training dataset; then we performed the benchmark model, i.e., RF-RUS, on the cleaned training dataset. We denoted the RF-RUS procedure featuring a data-cleaning procedure as DC-RF-RUS.

Third, we incorporated the proposed post-filtering procedure into DC-RF-RUS; the resulting procedure was termed DC-RF-RUS-PF.

We performed stringent leave-one-out cross-validation tests on Dset186 for RF-RUS, DC-RF-RUS, and DC-RF-RUS-PF. Performance comparisons between the three methods are listed in Table 2. Please note that in each round of cross-validation for DC-RF-RUS and DC-RF-RUS-PF, only the sequences in the training subsets were cleaned and the sequence in the testing subset was not cleaned.

Table 2 Performance comparisons between RF-RUS, DC-RF-RUS, and DC-RF-RUS-PF on Dset186 over leave-one-out cross-validation

Table 2 shows that DC-RF-RUS outperforms RF-RUS with respect to all six considered evaluation indices except for Recall. DC-RF-RUS achieves a 1.7 % improvement on MCC, which is an overall measurement of the quality of binary predictions. The performance comparison between DC-RF-RUS and RF-RUS demonstrates that the proposed data-cleaning procedure improves the data quality, which will facilitate the training of a machine-learning-based PPI prediction model. We also find that the prediction performance is further improved, although not significantly, by incorporating a post-filtering procedure into DC-RF-RUS. We also find that the value of Specificity was increased by 2.8 %, which supports the argument we made that the proposed post-filtering procedure can help reduce the number of false positive predictions. We note that the value of Recall was reduced by 2.2 %; the underlying reason is that the post-filtering procedure mistakenly reassigned true positive predictions to false negatives. Nevertheless, the overall performances, e.g., Accuracy, F-measure, and MCC, were still improved by incorporating post-filtering, as shown in Table 2.

Comparisons with Existing PPIs Predictors Over Cross-Validation Test

In this section, we compare the proposed method, i.e., DC-RF-RUS-PF, with two of the most recently released sequence-based PPIs predictors on Dset186 over a stringent leave-one-out cross-validation test. The first predictor compared is LORIS (Dhole et al. 2014), which identifies PPIs from protein sequences using an L1-regularized logistic regression under the same feature set applied in this study; the second one is PSIVER (Murakami and Mizuguchi 2010b), which is also a sequence-based PPIs predictor but uses a naïve Bayesian classifier with position-specific scoring matrices (PSSM) and predicted accessibility (PA) as feature sources.

Table 3 summarizes the performance comparisons between the proposed method, LORIS, and PSIVER. The table shows that the proposed method significantly outperformed PSIVER with respect to all evaluation indices except for Specificity. Improvements of 7.8 and 2.9 % on MCC and F-measure, respectively, were achieved by the proposed method relative to the values measured for PSIVER. We also observed that the value of Specificity for PSIVER is 7.7 % higher than that of the proposed method. However, the value of Recall of PSIVER is only 41.6 %, which is 19.6 % lower than that of the proposed method, indicating that too many false negatives were incurred during the predictions of PSIVER. Regarding LORIS, which is a recently reported PPI predictor, the proposed method achieves comparable performance. Because the proposed method and LORIS used the same feature set, we can argue that the proposed DC procedure and post-filtering procedure are effective for PPI prediction.

Table 3 Performance comparisons between the proposed method, LORIS, and PSIVER on Dset186 over leave-one-out cross-validation

Comparisons with Existing PPIs Predictors Over Independent Validation Test

In this section, we explore the generalization capability of the proposed method, i.e., DC-RF-RUS-PF, by comparing it with existing PPIs predictors on two independent test datasets, i.e., Dtestset72 and PDBtestset164. In addition to LORIS (Dhole et al. 2014) and PSIVER (Murakami and Mizuguchi 2010b), several other existing sequence-based PPIs predictors, including SPRINGS (Singh et al. 2014), ISIS (Ofran and Rost 2007), and SPPIDER (Porollo and Meller 2007), were used.

Table 4 lists the performance comparisons between the proposed method and five other sequence-based PPIs predictors on the independent validation dataset Dtestset72. To fairly compare with previously developed predictors, for Dtestset72, we calculated the individual prediction performances for the 27 rigid body cases, the 6 medium cases, the 3 difficult cases (Murakami and Mizuguchi 2010b), and the overall averaged prediction performance on the entire dataset.

Table 4 Performance comparisons between the proposed method and other sequence-based PPIs predictors on the independent validation dataset Dtestset72

Table 4 shows that the proposed method outperformed five other PPIs predictors for each of the three cases. Regarding the overall prediction performance, average improvements of 2.7 and 1.2 % on MCC and F-measure, respectively, were achieved by the proposed DC-RF-RUS-PF compared with LORIS, which is the most recently released sequence-based PPIs predictor.

We acknowledge that ISIS exhibits the best performance in terms of Accuracy. However, under the class imbalance scenario, over-pursuing overall accuracy is not appropriate and can be deceiving in evaluating the performance of a predictor/classifier (He and Garcia 2009b; Yu et al. 2013b, c). As shown in Table 4, the Recall value of ISIS is only 35.0 %, which is the lowest of the six consider predictors. In other words, ISIS predicts too many false negatives, leading to a very small value of MCC, which is the overall measurement of the quality of binary predictions.

Table 5 lists the performance comparisons between the proposed method and other four sequence-based PPIs predictors on the independent validation dataset PDBtestset164. Table 5 shows that the DC-RF-RUS-PF method again achieved the best performance on PDBtestset164. Moreover, DC-RF-RUS-PF is significantly better than SPRINGS (Singh et al. 2014), PSIVER (Murakami and Mizuguchi 2010b), and SPPIDER (Porollo and Meller 2007). Compared with LORIS (Dhole et al. 2014), which outperformed other existing methods, DC-RF-RUS-PF also makes an improvement of 3.7 % on MCC and F-measure.

Table 5 Performance comparisons between the proposed method and other predictors on the independent validation dataset PDBtestset164

Clearly, the test results demonstrate that the generalization capability of the proposed method outperforms that of the previously reported methods. The good performance on an independent test further demonstrates the effectiveness of the proposed method for protein–protein interaction prediction.

Conclusions

In this study, we developed a DC-RF-RUS-PF algorithm for protein–protein interaction prediction. Experimental results obtained for benchmark datasets demonstrate the superiority of the proposed method over the existing PPI predictors. The good performance of the proposed method is derived from the use of the combined discriminative feature of protein residues and the powerful RF classification algorithm, particularly a data-cleaning procedure that can remove marginal targets and a post-processing procedure that can potentially reduce predicted false positives. Because PPI prediction is a typical imbalanced learning problem, our main focus is on cleaning ‘harmful’ non-interactive residues. In this respect, the aim is to eliminate classification bias in training models caused by class overlap attributed to class imbalance. The results of cross-validation and independent validation tests indicate the efficacy of the method. Our method is not limited to PPI prediction and can be applied to other bioinformatics problems in which serious class imbalance exists. As demonstrated in a series of recent publications (Chen et al. 2014, 2015; Ding et al. 2014; Jia et al. 2015b; Lin et al. 2014; Liu et al. 2015b, c; Guo et al. 2014; Chen et al. 2013), in developing new prediction methods, user-friendly and publicly accessible web servers will significantly enhance the effects of this imbalance (Chou 2015). Therefore, we will make efforts in our future work to provide a web server for the prediction method presented in this paper. Nevertheless, the benchmark datasets and the source code for the proposed method and have been made available at http://csbio.njust.edu.cn/bioinf/PPIS for free academic use.

Our future work will focus on further enhancing the accuracy with which protein–protein interaction sites are predicted by incorporating new discriminative features and powerful classification algorithms. Currently, the proposed method requires approximately 100 s for predicting a sequence with 300 residues. We will further optimize the method to improve the computational efficiency.