Keywords

1 Introduction

Biological functions are supported by interactions between cellular components, including proteins. Protein-protein interaction (PPI) networks comprise nodes, corresponding to proteins, and edges, denoting protein interactions. The identification of protein complexes is important to understand the operational principles of cell systems, from signaling to metabolism. Several high-throughput technologies [1] have emerged to study PPIs, resulting in the assembly of large-scale PPI networks for well-studied, model organisms. However, due to the design and steps involved in these technologies, the resulting PPIs contain a sizeable portion of false-positive and false-negative interactions, which affect the identification of protein complexes [2].

Computational approaches are often used in the identification of protein complexes. These computational approaches have been grouped based on several criteria: (i) supervised vs. unsupervised, (ii) using only PPI network or also integrate other data [3]. To facilitate the comparison of these approaches and evaluate the predicted complexes, several gold standards, e.g. EcoCyc for Escherichia coli [4], MIPS, SGD, and CYC2008 for yeast [5,6,7], and CORUM for H. sapiens [8], have been assembled.

While there has been considerable improvement in the performance of computational approaches for prediction of protein complexes, there still some key challenges that remain open. The existing approaches rely on the idea that protein complexes correspond to highly connected clusters in PPI networks. As a result, they are often not capable of simultaneously identifying both dense and sparse, as well as, small and large protein complexes [3]. These approaches also depend on multiple parameters, which renders it difficult to interpret the resulting protein complexes. In contrast, the recently proposed PC2P and GCC-v [9, 10] represent parameter-free algorithms that model protein complex as biclique spanned subgraphs [11]. As a result, they can identify sparse as well as dense protein complexes independent of size, since biclique spanned subgraphs include stars, bicliques, and cliques as special graph classes. Nevertheless, if an approach for protein complex prediction uses only a given PPI network as input, its performance will be affected by erroneous and missing interactions in PPI networks. To overcome the issue of missing PPIs, link prediction algorithms have been recently proposed [12]. Therefore, the graph clustering and link prediction algorithms can be used jointly to improve protein complex prediction.

Building on this idea, here we introduce a new approach, referred to as CUBCO, that models protein complexes biclique spanned subgraphs, identified via minimum cut (unlike the local approaches in PC2P and GCC-v), and integrates link prediction to add most probable missing PPIs and investigate their effect on the performance of predicting protein complexes.

2 Results

2.1 CUBCO Predicts Protein Complexes Using Min Cut

Let \(G=(V,E)\) be a graph with a set of nodes, \(V\), corresponding to proteins, a set of edges, \(E\), denoting PPIs. The simple graph denoted by \(\overline{G }=(V,\left\{\left(u,v\right)|\left(u,v\right)\notin E\right\})\) denotes the complement of \(G\). We formalize the concept of a protein complex by a biclique spanned graph, \(G=(V,E)\), whose node set can be partitioned into two subsets, \({V}_{1}(G)\) and \({V}_{2}(G)\), and its edge set has the edges in the bipartite clique on \({V}_{1}\left(G\right)\) and \({V}_{2}\left(G\right)\) as a subset. Intuitively, a biclique spanned subgraph can be seen as a bipartite clique to which additional edges have been added. A biclique spanned graph has two properties: (i) the distance between any two nodes is at most two; (ii) Its complement, \(\overline{G }\), is disconnected [13]. These properties provide a natural formalization of a network cluster based on connectedness, since the complement of a cluster defined this way is disconnected. Thereby, the goal is to partition the graph \(G\), \(C=\{{C}_{1},{C}_{2},\dots ,{C}_{k}\}\), such that each \({C}_{i}\) is a biclique spanned subgraph [11]. This can be obtained by utilizing not only local properties (e.g. second neighborhood or clustering coefficient [9, 10]), but also global properties of the graph.

The complement of graph, \(\overline{G }\), contains edges that are not present in the original graph \(G\). From biological perspective, the edges of \(\overline{G }\) include false-negative and true-negative PPIs. Several studies have predicted the missing edges in PPI networks based on different concepts, with the network-based prediction of PPIs that relies on network walks of length three yielding the best results [12]. This approach favors inclusion of edges between nodes that are connected with a higher number of walks of length three in a given graph \(G\). Here, we use the advantage of this approach, but rely on paths—instead of walks—of length three, to avoid effects of direct neighbors. Thereby, we assign weight to the edges of \(\overline{G }\) based on normalized number of paths of length three, given by:

$$w\left(u,v\right)=\sum_{i,j}\frac{{P}_{u,i}{P}_{i,j}{P}_{j,v}}{\sqrt{{d}_{i}{d}_{j}}},$$
(1)

where \({P}_{i,j}=1\) if nodes \(i\) and \(j\) are adjacent, and zero otherwise, and \({d}_{i}\) denotes the degree of node \(i\). Since one of the features of a biclique spanned subgraph is that its complement is disconnected, here we employ minimum cuts to discover biclique spanned subgraphs [14].

Given a graph \(G\), CUBCO predicts protein complexes in three steps (Fig. 1): (i) determine \(\overline{G }\), (ii) employ ideas from link prediction to weigh the edges in \(\overline{G }\) based on the degree-normalized number of path of length three between the end-nodes in \(G\) (Eq. (1), Algorithm 1), and (iii) iteratively identify minimum cuts in the edge-weighted graph \(\overline{G }\) (Algorithm 2), by using the Stoer-Wagner efficient, deterministic algorithm [15], until all resulting components are biclique spanned.

To render \(\overline{G }\) disconnected and obtain the biclique spanned subgraph, \({C}_{i}\), in \(G\), we have to remove nodes, rather than edges (since edges in \(\overline{G }\) are not in \(G\)). Here, we used the min-cut algorithm that yields a partition of two node subsets, \({S}_{1}\) and \({S}_{2}\), and a min-cut \({E}_{cut}=\left\{\left({u}_{i},{v}_{i}\right)\right|{u}_{i}\in {S}_{1}\,and\,{v}_{i}\in {S}_{2}, 1\le i\le k\}\) where \(k=|{E}_{cut}|\). Since, in practice, either \({S}_{1}\) or \({S}_{2}\) contain a single node, rather than considering all subsets for node removal, we consider the final biclique spanned subgraph to be either \({C}_{1}=\{({S}_{1}\cup {S}_{2})/{\bigcup }_{i=1}^{k}{u}_{i}\}\) or \({C}_{2}=\{({S}_{1}\cup {S}_{2})/{\bigcup }_{i=1}^{k}{v}_{i}\}\). The selection of the set \(C\) is guided by a score, Eq. (2), that shows the cohesiveness of the subgraph, \(G[C]\), induced by \(C\) in \(G:\)

$$s(C) = \frac{{\left| {E_{in} (G[C])} \right|}}{{\left| {E_{out} (G[C])} \right|}},$$
(2)

where \(|{E}_{in}(G[C])|\) is the number of edges inside the subgraph, and \(|{E}_{out}(G[C])|\) is the number of edges connecting the subgraph \(G[C]\) to the rest of the network. CUBCO then selects \({C}_{i}\) with the largest score and removes it from the graph \(\overline{G }\). The complexity of CUBCO is \(O(\frac{{n}^{2}}{d+1}\left(m+nlogn\right))\), where \(n\) is the number of nodes, \(m\) is the number of edges, and \(d\) is the minimum degree.

figure a
figure b

2.2 Comparative Performance of CUBCO Without Link Prediction

We used twelve performance measures (Suppl. Information, see GitHub link in Method) to compare the predicted protein complexes from 17 approaches, including CUBCO, with all combinations of PPI networks and protein complexes from two E. coli, two S. cerevisiae, and one H. sapiens gold standards. The performance measures include: maximum matching ratio (MMR), fraction match (FRM), separation (SEP), positive predictive value (PPV), Sensitivity (SN), accuracy (ACC), precision, recall, F-measure, precision+, recall+, and F-measure+, with ranges between 0 and 1 and larger values indicating better performance (see Supplementary Information). We also calculated a composite scores given by the sum of MMR, FRM, ACC, and F-measure, that has been used in comparative analysis of approaches for protein complex prediction [16].

Fig. 1.
figure 1

Illustration of min-cut biclique spanned subgraph identification in CUBCO. The weight for each edge in the complement of the graph \(G\) is calculated based on Eq. (1). The global min-cut of the graph \(\overline{G }\) is obtained by Stoer-Wagner’s algorithm. In this case, the algorithm separates node 6 from node-set {3, 4, 5} by removing nodes 2 and 1. Hence, the first cluster {3, 4, 5, 6} corresponds to a biclique spanned subgraph in \(G\). After removal of this subgraph from \(G\), the second cluster is given by {1, 2}.

For the combinations of Babu PPI networks and gold standards of E. coli, CUBCO resulted in the highest composite score (Fig. 2A, Tab. S3, Fig. S1, see GitHub link in Method). CUBCO exhibited composite score larger than half of the approaches for the combinations of Kong PPI network and gold standards of E. coli (Tab. S3, Fig. S1). For the combination of Gavin PPI network and the SGD gold standards in S. cerevisiae CUBCO exhibited the highest composite score in 62.5% of cases, preceded only by PC2P and GCC-v approaches that also represent protein complexes as biclique spanned subgraphs (Fig. 2B). In the remaining combinations of networks and gold standards CUBCO obtained a composite score higher than half of the compared approaches. More precisely, on average, the composite score of CUBCO is only 3.5% smaller than the composite score of the contenders ranked higher than CUBCO (Tab. S3, Fig. S1). In the case of H. sapiens, CUBCO exhibited the highest composite score among all other contenders for the PIPS PPI network and CORUM gold standard (Fig. 2C). However, for the combination of the STRING PPI network and CORUM, the composite score of CUBCO was on average 40.4% smaller than the contenders with a higher composite score (Tab. S3, Fig. S1).

Fig. 2.
figure 2

Comparative analysis of approaches for prediction of protein complexes across PPI networks of different organisms. PPI networks for three organisms are considered (A) E. coli, (B) yeast, and (C) human. The comparative analyses are conducted with respect to a composite score that is a sum of MMR, FRM, ACC, and F-measure (see Supplementary file, Evaluation metric). Seventeen approaches, ordered by the year of publication, are compared on three PPI networks-gold standard combination. CUBCO outperforms all other approaches based on the composite score except in yeast, where it ranked third.

2.3 Integrating Link Prediction to Improve Performance of CUBCO

Next, we aimed to integrate a link prediction approach with CUBCO to resolve issues due to missing of PPIs from experimental approaches. To this end, we applied a modification of the L3 approach from [12], and ranked the missing edges by the number of paths of length 3 to avoid effects of immediate neighbors (Eq. (1)). We then selected the first 500, 1000, 1500, 2000, and 2500 of the ranked missing edges, and added them to the original network (as false-negative edges). Finally, we applied CUBCO on the modified network to predict the protein complexes. We found that the performance measures, and thereby, the composite score, increased slightly for all combinations of PPI networks and gold standards. The exception is the combinations of PPI networks of E. coli and KroganCore PPI network of yeast with Metabolic and SGD complexes, respectively. After careful investigation of these PPI networks, we realized that before inserting the new edges, they were of larger density and transitivity in comparison to all combinations of PPI networks of E. coli and KroganCore with Ecocyc and CYC2008 complexes. The result showed that the highest composite score for half of the combinations of PPI networks and gold standards was obtained by adding the first 500 ranked missing edges. For the combination of KroganExt PPI network of yeast with SGD complexes, the composite score improved gradually by increasing the number of edges from 500 to 2500 (Fig. 3, Tabs. S4–6, Fig. S2). Overall, the finding implied that integrating of link prediction to graph clustering methods can enhance the predicted protein complexes.

Fig. 3.
figure 3

Composite score due to integration of link prediction by adding the most probable edges. To investigate the composite score of CUBCO with the integration of link prediction, we inserted the first 500, 1000, 1500, 2000, and 2500 edges to the original PPI networks. The composite score is calculated for the combination of PPI networks and the gold standard of (A) E. coli and Ecocyc, (B) S. cerevisiae and CYC2008, and (C) H. sapiens and CORUM, respectively. On average, the composite score increases slightly with the increment of inserted edges across all PPI networks. (D) The original composite score of CUBCO across all combinations of organisms was compared with the obtained average composite score from inserting considered number of edges.

2.4 Prediction of New Protein Complexes in Conserved PPI Networks

We also applied CUBCO to predict new protein complexes in conserved PPI networks across 13 plant species, termed Pan-Plant [17], and across nine animal species, termed Metazoan [18] PPI network. For both PPI networks, we considered the high-confidence interactions with a score greater than 0.5 and used the gold standard, which is included in their corresponding studies (Tab. S1).

By applying CUBCO on the Pan-Plant PPI network, the resulting clusters fully coincided with 47 out of 118 known protein complexes (39.83%), and we predicted 249 new protein complexes that share no protein with the gold standards used (Tab. S7). To support the hypothesis that the proteins participating in a protein complex are involved in similar molecular functions and participate in the same cellular component and biological process, we determined the semantic similarity of the new protein complexes [19]. To this end, we employed the GOSim R package [20] to compute the median semantic similarity for every pair of proteins in each new protein complex. As a result, 77 of the predicted protein complexes (30.92%) showed GO semantic similarity of molecular functions equal to 1, indicating highest similarity of proteins function in the predicted protein complexes (Tab. S8). Further, we evaluated the predicted protein complexes by analyzing the domain-domain interactions (DDI) of proteins. We focused on the PPI network of Arabidopsis thaliana, identified the domains for ~46% of proteins based on the Pfam database [21], and considered the network of high-confidence (gold and silver) DDIs [22]. We found DDI support for 27 predicted protein complexes, of which 11 showed GO semantic similarity of 1 (Tab. S8). Finally, we only considered the predicted complexes that share no proteins with complexes in the gold standard, to obtain new predictions. We identified 165 new protein complexes, of which 19 have DDI support (Tab. S9). For instance, we found that oxidative stress tolerance protein NQR (AT1G49670) forms a complex with AIM1 (AT4G29010) and ECHIA (AT4G16210). They share the same cellular compartment and are located in the peroxisome. One of the major functions of peroxisome is the Beta-oxidation of fatty acid [23], and AIM1 and ECHIA participate in this process, while NQR acts as a response to oxidative stress and co-regulated with Beta-oxidation genes [24]. Therefore, the predicted new complexes are of high quality and can form basis for future experimental confirmation.

Focusing on the Metazoan PPI network, CUBCO fully recovered 105 out of 981 known protein complexes (10.70%) and predicted 221 new protein complexes (Tab. S7). By conducting GO semantic similarity based on molecular function, 46 of the predicted protein complexes (20.81%) showed a value of 1 (Tab. S8). Based on the Pfam database, we then detected the domains for all proteins of the H. sapiens PPI network as part of the Metazoan network. We found that 88 of the predicted protein complexes have the DDI support, while only 15 of them also obtained GO semantic similarity of MF category equal to 1 (Tab. S8). Finally, we found only two new protein complexes that share no proteins from the gold standard, of which one is supported with DDI data (Tab. S9). The predicted complex with DDI support is the NXF1:NXT1 complex, which is also known as the TAP:p15 complex. This complex is a general mRNA nuclear export factor, and it is conserved from yeast to humans [25]. Altogether, the results indicate that CUBCO improves the performance of protein complex prediction and can find biologically meaningful protein complexes.

3 Method

3.1 Contending Approaches for Prediction of Protein Complexes

We compared the performance of CUBCO with sixteen other state-of-the-art methods, including: Markov Clustering (MCL), Molecular Complex Detection (MCODE), CFinder, Affinity Propagation (AP), Clustering-based on Maximal Cliques (CMC), Clustering with Overlapping Neighbourhood Extension (ClusterOne), PEWCC, Prorank+, Discovering Protein Complexes based on Neighbor Affinity and Dynamic Protein Interaction Network (DPC-NADPIN), Core&Peel, Inter Module Hub Removal Clustering (IMHRC), Protein Complexes from Coherent Partition (PC2P), and GCC-v (for references and software used, see Tab. S2). To facilitate fair comparison, the approaches are selected based on the public availability of their implementations and independence of any additional knowledge or data. Where applicable, we have used the default value of parameters.

3.2 PPI Networks and Gold Standards of Protein Complexes

We carried out all the experiments on PPI networks and gold standards of three model organisms: E. coli., S. cerevisiae, and H. sapiens. All the PPI networks are edge-weighted except one from E. coli. We used the two PPI networks of E. coli. from [26, 27], named by the first authors’ names (Babu and Cong). The two gold standards were given by manually curated protein complexes from Ecocyc [4] and protein complexes based on the genome-scale metabolic network of E. coli [28]. For yeast, we used: Collins [29], Krogan core (edge-weight ≥ 0.273), Krogan extended (edge-weight ≥ 0.101) [30], and Gavin [31] PPI networks. The two gold standards were retrieved from CYC2008 [7] and complexes derived from the Saccharomyces Genome Database (SGD) [6]. For H. sapeiens, the two PPI networks were obtained from STRING (edge-weight \(\ge\) 999) [32] and PIPS (edge-weight \(\ge\) 25) [33]. In addition, we employed CORUM as the gold standard for human protein complexes [8] (Tab. S1). The comparative analysis of CUBCO and all the other approaches was performed on an Intel(R) Xeon(R) CPU E5-2670 v2 with 2.50GHz. CUBCO is freely available on GitHub at https://github.com/SaraOmranian/CUBCO, which also includes the supplementary figures and tables.

4 Conclusion

We proposed a new approaches, called CUBCO, to predict protein complexes from PPI networks. Since we have shown that partitioning a PPI network into biclique spanned subgraphs shows the best prediction performance of protein complexes [9, 10], CUBCO adopted the same concept. However, unlike the existing approaches, it relies on min-cuts, as a global network property, to determine the partition into biclique spanned subgraphs. CUBCO also employs paths of length three to rank false-negative interactions and include them in the networks to boost prediction performance. Future work will inspect the effect on coupling other approach for link prediction with approaches for prediction of protein complexes. In addition, we plan to consider weighted node cuts, rendering CUBCO applicable with large-scale proteomics and gene expression data.