Keywords

1 Introduction

Protein complex is a basic structural unit that can cooperate with each other to complete the specific biological function [1, 2]. The protein-protein interaction (PPI) network [3] is composed of a number of complexes which are related to each other to perform certain functions.

In recent years, many methods have been proposed to predict protein complexes. These methods greatly promote the progress in the field of complexes prediction. According to the category of complex recognition algorithms, these algorithms can be divided into the following categories: recognition algorithm based on dense sub-graph, recognition algorithm based on hierarchical clustering, recognition algorithm based on core-attachment structure.

Based on the theory of dense sub-graph in PPI networks, lots of algorithms are proposed. In 2003, Spirin et al. [2] proposed the algorithm using the results of traversing the fully connected graphs to identify complexes. Bader et al. [4] proposed a method, called molecular complex detection (MCODE). Palla et al. [5] proposed clique percolation method (CPM) based on the close connection of sub-graph filtering algorithm. And the algorithm’s application software is developed, called CFinder [6]. In 2006, Altaf-UI-Amin et al. [7] proposed the DPClus which can get the overlapping complexes. In 2008, Li et al. [8] proposed IPCA algorithm, which is based on dense sub-graph to identify overlapping complexes. In 2009, Liu et al. [9] proposed clustering based on maximal cliques (CMC) algorithm, which can dig out the dense sub-graph in PPI networks.

Hierarchical clustering theory is also used to predict protein complexes. In 2002, Aivan and Newman proposed GN algorithm [10], which is used to partition the modules in complex networks. In 2004, Hartuv et al. [11] proposed highly connected sub-graph (HCS) algorithm. In 2009, Li et al. [12] proposed a fast hierarchical clustering algorithm based on the local variable and edge clustering coefficient, and redefined the protein complex. In 2012, Wang et al. [13] proposed OMIM algorithm to predict duplicate complexes in hierarchical clustering system.

The core-attachment structure of complex is a significant view to detect protein complexes. Leung et al. [14] designed CORE algorithm which calculates the p-value for all pairs of proteins to detect cores. Wu et al. [15] proposed COACH algorithm.

Recently, swarm intelligence algorithms have been successfully applied to the detection of complexes in PPI networks [20]. In addition, there are many other algorithms, such as Markov Clustering (MCL) [16, 17] algorithm, ClusterONE [18] algorithm, SPICi [19] algorithm and so on.

In this paper, we proposed a protein complex prediction algorithm based on core-attachment structure and ant colony optimization method, CA-ACO. First, we adopt the weighted matrix of the dynamic PPI network as the similarity matrix of the undirected graph. Second, we use the clustering coefficient value of every node to obtain the core proteins. We mark every neighbor node of core protein with unique label through picking and dropping principle of ACO. Third, filtering processes are carried out to obtain the clustering result.

2 Methods

2.1 The ACO Based Core-Attachment Design

Since the protein is not always active in the cell cycle, in order to construct a dynamic model, we integrate the static PPI data and gene expression data because gene expression level and protein expression level are consistent. If the gene expression data at a certain timestamp is better than a threshold, then it can be considered that the protein is active at this timestamp. By using three-sigma [29] principle, active threshold is set. At a certain timestamp, if two proteins are active and interactional, it can be considered that there is an edge between the two proteins at this time. As gene expression data has 12 timestamps, the static network is divided into 12 sub-graphs which correspond to 12 timestamps. Eventually, the dynamic PPI network is constructed. Figure 1 shows a process of dynamic PPI networks construction.

Fig. 1.
figure 1

An illustration example of dynamic PPI networks construction

Considering the organization of complexes, we combine the structure of core-attachment [14] with the principle of picking and dropping to predict complexes, and propose a novel algorithm, named CA-ACO. Figure 2 shows a part of the core-attachment formation design. There are two clusters which are in dotted circles of green and blue. Protein p and protein q are seed proteins. The connection between a protein and others is represented by a solid line or dotted line. The full line represents that two proteins belong to a cluster. Otherwise, they don’t belong to a cluster. Taking p’s neighbor protein a as an example, protein a does not belong to the cluster whose seed protein is protein p. Then, we should pick up a, and decide if protein a belongs to other cluster. Protein b is the neighbor protein of protein p and q. As the Fig. 2 shows, the protein b belongs to two clusters whose seed proteins are protein p and protein q. For protein c which is connected with protein p and protein q. Firstly, node c did not belong to cluster p and was picked up. Then, we have to decide its relationship with other seed nodes. There is a full line between c and q. Then, node c needs to be dropped out and put in the cluster whose seed node is protein q.

Fig. 2.
figure 2

A part of the core-attachment formation design (Color figure online)

2.2 Description of CA-ACO Algorithm

The process of CA-ACO algorithm can be divided into 3 steps: similarity matrix initialization, clustering and purification.

In the first step, the similarity matrix of the dynamic PPI network is composed of 12 sub networks’ weighted matrix. The greater the weight value is, the greater the similarity between the nodes is. The similarity matrix of PPI network initialization is shown in Eq. (1).

$$ S({\text{v}}_{i} ,{\text{v}}_{j} ) = \left\{ {\begin{array}{*{20}c} {s({\text{v}}_{i} ,{\text{v}}_{j} )} & {if(v_{i} ,v_{j} ) \in E} \\ 0 & {else} \\ \end{array} } \right. $$
(1)

Where s(v i , v j ) is weight value of edge (v i , v j ). It represents the strength of the interaction of proteins v i and protein v j in weighted PPI network.

In the second step, we can obtain protein complex set from dynamic PPI network. There are two main processes: seed protein selection and attachment formation. In seed protein selection process, we need to compute the clustering coefficient value [30] of every protein by Eq. (2).

$$ ccv({\text{v}}_{i} ) = \frac{{2 \times n_{i} }}{{|neigh({\text{v}}_{i} )| \times (|neigh({\text{v}}_{i} )| - 1)}} $$
(2)

where neigh(v i ) is neighbor nodes set of a node v i. A protein whose clustering coefficient value is greater than the threshold will be added to the core protein set. In attachment formation step, we need to access to neighbor protein set of every core protein. By carrying out the picking and dropping operation [21] of ACO, seed proteins’ neighbor proteins can be clustered. The probability of picking is calculated by Eq. (3).

$$ pp({\text{v}}{}_{j}) = (\frac{{k_{p} }}{{k_{p} + s({\text{v}}_{i} ,{\text{v}}_{j} )}})^{2} $$
(3)

where k p is a picking constant whose range of values is from 0 to 1, s(v i , v j ) is the similarity between the protein v j and the current core protein v i . In the operation of picking, the probability of picking is compared with a random probability. When the probability of picking is more than the random probability, the operation of picking is executed. Otherwise, the protein v j is labeled the complex whose seed protein is the protein v i . The probability of dropping is calculated by Eq. (4).

$$ pd({\text{v}}_{j} ) = \left\{ {\begin{array}{*{20}c} {2 \times s({\text{v}}_{i} ,{\text{v}}_{j} )} & {\begin{array}{*{20}c} {if} & {s\text{(}{\text{v}}_{i} \text{,}{\text{v}}_{j} \text{)} < k_{d} } \\ \end{array} } \\ 1 & {else} \\ \end{array} } \right. $$
(4)

where k d is a dropping constant whose range of values is 0 to 1, s(v i , v j ) is the similarity between the protein v j and the current core protein v i . The probability of dropping is compared with random probability in the operation of dropping. When pd(v j ) is more than a random probability, the operation of dropping is executed. Therefore, the protein v j is labeled the complex whose seed protein is the protein v i . Through the clustering process, we can get the initial clustering results where complexes have core-attachment structure.

In the third step, purification process is carried out. In the complex set, the complex with just one protein is deleted, and protein complex which has same proteins as others is removed. The protein complex set is obtained.

The pseudo-code of CA-ACO method is described as follows.

figure a

3 Experiments and Results

3.1 Experimental Dataset

In this paper, we adopt the PPI data of S.cerevisiae from DIP [24], MIPS [31] and Krogan database [32]. Dynamic PPI networks [3] at 12 timestamps correspond to 12 static PPI subnets. Different subnets have different size, as shown in Table 1.

Table 1. The number of proteins and interactions in each subnet of different PPI networks

In this paper, we use the standard known protein complex set, CYC2008 [25], which contains 408 complexes and 1,628 proteins. The biggest cluster has 81 proteins while the smallest cluster has 2 proteins in protein complexes of CYC2008.

3.2 Evaluation Criteria

The precision [24] indicates the proportion of the predicted protein complexes successfully matched by the standard protein complexes in the prediction of the complex. It can be defined as:

$$ precision = \frac{{N_{cp} }}{|P|} $$
(5)

where |P| represents the number of predicted protein complexes, and N cp indicates that the number of the predicted complexes successfully matched by the known protein complexes.

The recall [24] indicates the proportion of the known protein complexes successfully matched by the predicted complexes in the standard of the complex. It can be defined as:

$$ recall = \frac{{N_{cb} }}{|B|} $$
(6)

where |B| represents the number of known protein complexes, and N cb indicates that the number of the standard protein complexes successfully matched by the predicted protein complexes.

The f-measure [24] denotes the harmonic mean of precision and recall. It can be defined as:

$$ f - measure = \frac{2 \times precision \times recall}{precision + recall} $$
(7)

In order to further validate the biological significance of the predicted protein complexes, we need to carry out the functional enrichment analysis by using p-value [36] formulated as follows:

$$ p - value = \sum\limits_{i = m}^{n} {\frac{{\left( {\begin{array}{*{20}c} M \\ i \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - M} \\ {n - i} \\ \end{array} } \right)}}{{\left( {\begin{array}{*{20}c} {\text{N}} \\ {\text{n}} \\ \end{array} } \right)}}} $$
(8)

where N is the number of protein in the PPI network, M is the number of proteins in a GO term, n is the number of proteins which are annotated with the same GO term. Generally, the smaller the p-value of a protein complex is, the stronger biological significance the complex processes will be.

3.3 Comparison with Other Methods

To evaluate the performance of CA-ACO algorithm, we compare CA-ACO with CMC [9], MCODE [4], CFinder [6], ClusterONE [18], CORE [14], COACH [15], RNSC [25], DPClus [7], MCL [16, 17], ACO-MCL [26], HC-PIN [27], MOEPGA [28] and FOCA [20] in terms of precision, recall and f-measure in the DIP dataset. It is obvious that the precision value of our method is greater than other methods precision value. The recall values of CMC, MOEPGA and FOCA algorithms are superior to our method, which are 0.5900, 0.6000 and 0.6360 respectively. However, the f-measure value of our method is much higher than other typical algorithms’ f-measure value. Our method’s f-measure value is 0.6653. It indicates that the performance of CA-ACO algorithm is optimal. The above analysis can be shown in Fig. 3.

Fig. 3.
figure 3

Precision, recall, f-measure values of various algorithms on the DIP dataset

Moreover, we also compare our method with the following prediction methods: CSO [33], ClusterONE [18], COACH [15], CMC [9], HUNTER [34] and MCODE [4] in terms of precision, recall and f-measure in the MIPS and Krogran dataset. As shown in Fig. 4, our method achieves the highest f-measure of 0.6025, recall of 0.5524 and precision of 0.6665 in MIPS dataset. On the Fig. 5, our method achieves the highest f-measure of 0.5844, recall of 0.4347 and precision of 0.8920 in the Krogan dataset.

Fig. 4.
figure 4

Precision, recall, f-measure values of various algorithms on the MIPS dataset

Fig. 5.
figure 5

Precision, recall, f-measure values of various algorithms on the Krogan dataset

We use functional enrichment analysis to validate the biological significance of methods. We calculate the p-value of detected complexes whose size are greater than or equal to 3. A complex is considered significant when its p-value is less than 0.01.

Table 2 lists the number and percentage of the identified complexes whose p-value is in the range of <E−10, [E−10, E−5), [E−5, 0.01),  >=0.01. Table 2 shows the comparison of the functional enrichment of complexes detected by CA-ACO, MCL, CORE and ClusterONE on DIP, MIPS and Krogan datasets. As shown in Table 2, we can obtain the number of predicted protein complexes by various methods on different datasets. The percentage and the amount of the predicted protein complexes with p-value fall into corresponding intervals. The percentage of complexes whose p-value is greater than 0.01 in predicted complexes by CA-ACO algorithm is the smallest. So, most of the predicted protein complexes by CA-ACO are meaningful. These illustrate that our proposed algorithm is competent to identified significant protein complexes in dynamic PPI networks.

Table 2. Functional enrichment analysis of complexes detected on DIP, MIPS and Krogan dataset

4 Conclusions

Many of the current methods predicting protein complexes are running in a static PPI network, which ignoring the dynamic properties of the PPI network and the inherent organization of the protein complex. In this paper, we proposed a novel method for detecting protein complexes in dynamic protein interaction networks, CA-ACO, which is based on the core-attachment structure of protein complexes. We compare the performance of the CA-ACO algorithm with other state-of-the-art methods in DIP, MIPS and Krogan dataset. Experimental results show that CA-ACO algorithm is obviously superior to other methods. In addition, the shift from static PPI networks to dynamic PPI networks is important to analyze the biological significance of complexes identified from PPI networks. In the future, we will further optimize our algorithm to improve the efficiency of algorithm and the effect of biological research.