1 Introduction

In the post-genomic era, proteomics research has become one of the most important areas in the life science (Ji et al. 2014a), where the analysis of protein–protein interactions (PPI) is fundamental to the understanding of cellular organization, processes, and functions (Zhang 2009). Indeed, biologists also reveal that cellular functions and biochemical events are coordinately carried out by groups of interacting proteins in functional modules (Guimera and Amaral 2005). Thus, identifying functional modules in PPI data contributes greatly to the understanding of cellular functions and mechanisms.

Nowadays, rapid advance in experimental technologies has led to an explosive growth in binary PPI data (Chin and Zhu 2013), but it is getting hard to detect functional modules only using experimental approaches, which have several limitations, such as too many processing steps and too time-consuming (Li et al. 2010; Tarassov et al. 2008). Fortunately, computational approaches based on machine learning and data mining can overcome these shortcomings to some extent and have become useful complements to the experimental methods for detecting functional modules in PPI data. In general, these computational approaches first model PPI data as a network (graph) \(G = (V, E)\), where V is the set of nodes with each node represents a protein and E is the set of edges with each edge represents an interaction (link). Then various clustering technologies are used to divide the network and get corresponding protein module partitions. So far, there have been many kinds of computational methods which adopt different ideas and schemes to mine functional modules. All these methods can be systematically categorized into two main groups (Ji et al. 2014a, b): traditional graph–theoretic approaches and non-traditional graph–theoretic approaches. The traditional graph–theoretic approaches mainly make an analysis on the topology structure of a PPI network and employ some traditional clustering technologies to identify functional modules, such as density-based clustering (Bader and Hogue 2003; Adamcsek et al. 2006; Altaf-Ul-Amin et al. 2006), hierarchy-based clustering (Ravasz et al. 2002; Arnau et al. 2005; Aldecoa and Marín 2010) and partition-based clustering (King et al. 2004; Frey and Dueck 2007; Abdullah et al. 2009). The non-traditional graph–theoretic approaches refer to the new emerging methods in recent years, which make use of novel computational technologies to detect functional modules in a PPI network. This type of approaches mainly includes flow simulation-based methods (Dongen 2000; Cho et al. 2007; Feng et al. 2011), spectral clustering-based methods (Sen et al. 2006; Qin and Gao 2010; Inoue et al. 2010), core attachment-based approaches (Leung et al. 2009; Wu et al. 2009; Ma and Gao 2012), swarm intelligence-based approaches and so on. Among them, swarm intelligence technology is a population-based metaheuristic approach, which has been increasingly popular due to its ability in solving a variety of complex scientific and engineering problems (Hinchey et al. 2007). Such technology simulates the social behavior of certain living creature and has been applied in functional module detection in PPI networks. For example, Sallim et al. (2008) for the first time introduced ant colony optimization (ACO) into functional module detection by taking it as the optimization problem of solving traveling salesman problem (TSP). Ji et al. (2012a) proposed a new algorithm based on ACO (called as NACO-FMD), which incorporates known biological knowledge—Gene Ontology (GO), into the process of detecting functional modules. To further overcome the shortcoming of easily trapped into a local optima for ACO algorithm, they combined ACO with multi-agent evolutionary (MAE) to identify functional modules (Ji et al. 2012b, 2013). Subsequently, they again presented an algorithm based on ant colony clustering (called as ACC-FMD) for functional module detection in PPI networks (Ji et al. 2015). Moreover, by uniting with functional flow clustering, artificial bee colony (ABC) algorithm was likewise used to identify functional modules in PPI networks (Wu et al. 2011).

Bacterial foraging optimization (BFO) (Passino 2002) is another famous swarm intelligence algorithm developed by Passino in 2002, which simulates the foraging behavior of Escherichia coli bacteria. The basic principle is that bacteria move through either tumbling or swimming to maximize the energy consumed by eating as many nutrients as they can. As the smallest creatures on the earth, bacteria contain many clever optimization mechanisms. Thus, BFO which is inspired by bacterial biological behavior has unique good performance and has been successful in a wide variety of optimization tasks (Das et al. 2009), and also found its way in the domain of protein functional module detection (Lei et al. 2011, 2013). However, there are two basic issues in Lei et al. (2011, 2013): (1) Each bacterium represents one protein node, and this individual representation is unable to realize information exchange among bacterial individuals, but information exchange among different individuals is a crucial mechanism for a swarm intelligence optimization algorithm. (2) Although three key operators including chemotaxis, reproduction, and elimination and dispersal are utilized, they are too simple to be in conformity with the biological mechanisms simulated in the original BFO algorithm. Thus, the above two basic issues make it difficult to fully release the potential of BFO algorithm for functional module detection in PPI networks. To explore the full potential of BFO algorithm in identifying functional modules, this paper conducts a further and thorough investigation and proposes a new bacterial foraging optimization algorithm for functional module detection in PPI networks, called BFO-FMD. In comparison with the previous work in Lei et al. (2011, 2013), the main contributions of this paper are summarized as follows:

  • The proposed algorithm employs a completely different representation of bacterial individual. Each bacterial individual represents a candidate module partition, which is easy to exchange information among different bacterial individuals.

  • In the proposed algorithm, the realization of four optimization mechanisms more faithfully follows the biological mechanisms of bacteria.

  • Systematic experiments have been conducted to compare the proposed algorithm with different kinds of state-of-the-art algorithms on three yeast PPI datasets using seven evaluation metrics.

2 BFO-FMD algorithm

2.1 Basic idea

By considering functional module detection as an optimization problem, this study uses BFO algorithm to solve the optimization problem and proposes a new algorithm called BFO-FMD. In BFO-FMD, each bacterial individual corresponds to a candidate module partition (i.e., a candidate solution), and is evaluated using the modularity metric, which is commonly used in the partition of a network into modules (Guimera and Amaral 2005), and is given below:

$$\begin{aligned} J(\theta ) = \sum \limits _{i = 1}^{M}\left[ \frac{e_{i}}{|E|}- \left( \frac{d_{i}}{2|E|}\right) ^{2}\right] , \end{aligned}$$
(1)

where \(\theta \) is a candidate module partition represented by a bacterium, M is the number of modules for a bacterium \(\theta \), \(e_{i}\) is the number of links between nodes in the ith module, \(d_{i}\) is the sum of the degrees of nodes in the ith module, and |E| is the number of all the links in the PPI network. In general, the modularity metric tends to favor the detected results which comprise many within-module links and as few as possible between-module links. The higher the modularity score, the better the detected result (i.e., the better the corresponding individual solution).

The procedure of BFO-FMD includes three stages: solution initialization, solution optimization, and post-processing. At first, BFO-FMD constructs each bacterial individual solution using a random-walk behavior. Then, it optimizes iteratively each individual solution using four biological mechanisms (chemotaxis, conjugation, reproduction, and elimination and dispersal) to look for superior solutions with higher modularity scores. At last, BFO-FMD carries out two post-processing steps to further refine the preliminary module partition detected. In the following, we will explain key parts of this algorithm in detail.

2.2 Solution representation and initialization

For swarm intelligence-based algorithms, it is essential to design appropriate solution representation and the initialization method for different problems. To address the problem of detecting functional modules in PPI networks, BFO-FMD makes each bacterial individual correspond to a candidate module partition and encodes it as a graph with N directed edges: \(\theta = \{\ (1\rightarrow a_{1}), (2\rightarrow a_{2}),\ldots , (i\rightarrow a_{i}), \ldots , (N\rightarrow a_{N})\}\), where i is a node label, \(a_{i}\) denotes the node label that the node i points to, and N is the number of nodes in a PPI network, i.e., \(N = |V|\). In the initialization process, a bacterium first randomly selects a starting node and then continuously makes use of a random-walk behavior to traverse other nodes in the PPI network. At each step, the bacterium is on a node, tries to move to a similar node which is functionally correlated or structurally similar to the current node, and then builds a link. Clearly, the greater the similarity between two nodes of a link, the stronger the connection strength. When there is no any satisfied node, the bacterium ends its current pathway by pointing to itself, picks up another untraversed node in a PPI network, and begins to a new traversal. This random-walk behavior continues until all the N nodes in a PPI network are traversed. During the initialization, each bacterium keeps on walking to a random similarity node with its current node step-by-step. For node i, its similarity node is selected randomly from the following set:

$$\begin{aligned} \varOmega _{i} = \left\{ j | \frac{(s_{ij} + f_{ij})}{2}>\varepsilon \right\} , \end{aligned}$$
(2)

where \(\varepsilon \) represents an overall similarity threshold, \(s_{ij}\) and \(f_{ij}\) denote the structural similarity and functional similarity between nodes i and j, respectively.

Fig. 1
figure 1

Bacterial individual representation and initialization process: a PPI network, b initialization process, c individual representation, and d an initial solution represented by a bacterial individual

Given two protein nodes i and j, the structural similarity formula is given as follows (Mete et al. 2008):

$$\begin{aligned} s_{ij}=\frac{|\varGamma (i)\bigcap \varGamma (j)|}{\sqrt{|\varGamma (i)||\varGamma (j)|}}, \end{aligned}$$
(3)

where \(\varGamma (i)\) is a set of the neighborhood nodes of node i plus itself, and \(|\varGamma (i)|\) is the size of the set.

Based on the annotation information of Gene Ontology (GO), the functional similarity formula is expressed as follows (Schlicker and Albrecht 2008):

$$\begin{aligned} f_{ij} = \frac{|g^{i} \bigcap g^{j}|}{|g^{i} \bigcup g^{j}|}, \end{aligned}$$
(4)

where \(g^{i}\) and \(g^{j}\) are GO term sets of nodes i and j, respectively.

To clearly show the solution representation and initialization process, Fig. 1 illustrates an example in a simplified PPI network with ten nodes. Figure 1a shows a PPI network containing ten nodes marked from 0 to 9, Fig. 1b gives the generation process of a bacterium denoting an individual solution, where this bacterium first picks out node 3 randomly, then it moves to node 4 which is selected out from the set in Eq. (2) at random and builds a link \(3 \rightarrow 4\). Next, it moves to node 8 from node 4 and builds a link \(4\rightarrow 8\) in the same way. This process is repeated until it reaches node 0 when there is not any node meeting the condition in Eq. (2), and the bacterium has to end its current pathway by pointing to itself \(0 \rightarrow 0\). Afterward, this bacterium randomly chooses another unvisited node 2 and restarts a new pathway until all the ten nodes are traversed. Figure 1c gives the solution presentation by adjusting the sequence in Fig. 1b according to the node label, and Fig. 1d shows the detected modules corresponding to the individual solution shown in Fig. 1c. It is obvious, such initialization method does not need a given number of modules in advance, and it automatically makes decision according to the links of a bacterial individual.

2.3 Chemotaxis

This mechanism simulates the foraging movement of E.colis through tumbling and swimming. To absorb more nutrients, each bacterium tries to find food in two ways: tumbling and swimming. A bacterium tumbles in a random direction to exploratively search for food. If the food is rich in the selected direction, the bacterium will swim along this direction, till the food gets bad or the bacterium has swum the fixed steps.

figure a

From the above description, a proper random direction generated through tumbling and an reasonable way of updating a solution are important for the realization of the chemotaxis mechanism. In BFO-FMD, a random direction is defined as a random masking vector with N dimensions, and the value of each dimension is either 0 or 1 with a certain probability \(P_\mathrm{ch}\). That is, for each dimension of a masking vector, if a uniform random number in the range of (0, 1) is smaller than \(P_\mathrm{ch}\), its value will be 1, otherwise it will be 0. Thus, when the bacterial population performs a chemotaxis step, each bacterium first generates an aforesaid masking vector denoted by D (equivalent to tumbling). Then, it updates its current solution according to the following method: select randomly a different bacterium and check the value of each dimension in D one by one. When the value is 1, the bacterium will compare the connection strength of the corresponding link with the selected bacterium. If its connection strength is weaker than that of the selected bacterium, the bacterium will replace its corresponding link with that of the selected bacterium. After the bacterium checks each component of D, it gets a new solution. The next step is to compute the modularity score of the new solution according to Eq. (1). If the modularity score of the new solution is higher than that of the old one, the bacterium will continue to optimize the new solution along the direction D in the same way (equivalent to swimming), till the modularity score of the new solution does not improve any more or the bacterium has carried out the maximum number of times \(N_{s}\). We summarize the chemotactic process of a bacterium described above into a function Chemotaxis_Process( i ), where i is a parameter showing that the ith bacterium performs the chemotactic process. For such chemotaxis mechanism, tumbling controls the direction of searching for better module partitions. Swimming is a driving force toward better module partitions when a bacterial individual finds a right direction. The chemotaxis mechanism is a complex and close combination of swimming and tumbling that keeps bacteria in these places with higher modularity scores for module partitions and plays a crucial role in searching for the module partition with the highest modularity scores.

Fig. 2
figure 2

Method of generating a new solution in chemotaxis operator

For clarity, Fig. 2 illustrates the method of generating a new solution in the chemotaxis operator, where \(\theta _{i}\) is a candidate solution represented by bacterium i which performs a chemotaxis operator, \(D=(0,0,1,0,0,1,0,1,0,0)\) denotes the chemotaxis direction, and \(\theta _{a}\) is another different solution represented by bacterium a. The 3rd, 6th, and 8th components are 1 in D, so bacterium i compares its 3rd link (\(2\rightarrow 3\)), 6th link (\(5\rightarrow 4\)), 8th link (\(7\rightarrow 6\)) with those of bacterium a (\(2\rightarrow 5\), \(5\rightarrow 7\) and \(7\rightarrow 8\)), respectively. Assume that the connection strength of 3rd link (\(2\rightarrow 3\)) and 8th link (\(7\rightarrow 6\)) in solution \(\theta _{i}\) is weaker than those of \(\theta _{a}\) (\(2\rightarrow 5\) and \(7\rightarrow 8\)), bacterium i replaces its 3rd link \(2\rightarrow 3\) and 8th link \(7\rightarrow 6\) with \(2\rightarrow 5\) and \(7\rightarrow 8\), respectively. Accordingly, bacterium i obtains a new solution.

2.4 Conjugation

Conjugation, as well as chemotaxis, is an important biological characteristic of bacteria. A bacterial conjugation is the transfer of part of plasmid (genetic material) from donor bacteria to recipient bacteria by a directly physical contact and is often regarded as the sexual reproduction or mating between bacteria. Some researchers have taken the bacterial conjugation as a message passaging mechanism in their work (Perales-Graván and Lahoz-Beltra 2008; Balasubramaniam and Lio 2013).

In this paper, we also simulate the bacterial conjugation behavior to play a part of information exchange between individuals in the proposed BFO-FMD algorithm. To model this biological behavior, we define conjugation probability \(P_{co}\) and conjugation length L (\(L < N\)) which decides the amount of changes to links of a module partition. When a bacterium carries out a conjugation step with a given probability \(P_{co}\), it first randomly selects another different bacterium and a conjugation point Bt (\(Bt<N-L\)), then the bacterium replaces its links with those of the selected bacterium from Btth dimension to \((Bt+L)\)th dimension. At last, the bacterium gets a new solution and will keep the superior one after comparing the new solution and the old solution. We summarize the conjugation process of a bacterium introduced above into a function Conjugation_Process( i ), where i is a parameter showing that the ith bacterium performs the conjugation process. With the new conjugation mechanism, the bacterial population has a strong information exchange channel. Each bacterium in the population no longer takes action blindly but searches for the module partition with the highest modularity score under the help of the other bacteria. At the same time, this new conjugation operator can also be considered as a kind of local optimization operator which is different from the chemotaxis operator. The operator improves each individual solution by changing some continuous links, while the chemotaxis operator enhances each solution by altering some discontinuous links. Although the two operators work through different mechanisms, they are supplement each other to look for the better partition with higher modularity score.

figure b

To be more clear, Fig. 3 displays the way of generating a new solution, where \(\theta _{i}\) is a candidate solution represented by bacterium i which performs a conjugation operator, \(\theta _{b}\) is another different candidate solution represented by bacterium b which is selected randomly from the population. \(Bt = 2\) is the conjugation point, and \(L=3\) is the conjugation length. Bacterium i, respectively, replaces three links \(2\rightarrow 5\), \(3\rightarrow 1\), and \(4\rightarrow 4\) in solution \(\theta _{i}\) with \(2\rightarrow 1\), \(3\rightarrow 4\), and \(4\rightarrow 8\) in solution \(\theta _{b}\). Accordingly, bacterium i obtains a new solution.

Fig. 3
figure 3

Method of generating a new solution in conjugation operator

2.5 Reproduction

The bacteria grow longer with the increase in absorption of nutrients. The more nutrients a bacterium gets, the healthier it is. Under appropriate conditions, some bacteria in a population which are healthy enough will asexually split into two bacteria, and the other ones will die. In the simulation of reproduction, the modularity score of a bacterium is used to measure its health degree, the higher the modularity score of a bacterium, the healthier it is. We define the number of chemotactic steps \(N_{c}\) to be the length of the lifetime of each bacterium, and \(S_{r}\), that is half of the population size S, be the number of bacteria who are healthy enough to make a copy of themselves. After \(N_{c}\) chemotactic steps, a reproduction step is taken. First, sort the bacterial population in a descending order on the basis of the modularity score. Each of the first \(S_{r}\) bacteria with higher modularity score split into exactly two same bacteria, and the other \(S_{r}\) bacteria will die, which maintains a constant size of the population. In essence, this mechanism represents a fairly abstract model of Darwinian evolution and biological genetics in genetic algorithms, which can pass the elite information among swarm individuals, and speed up the convergence.

2.6 Elimination and dispersal

When the local environment where a population of bacteria live changes, all the bacteria in this region may be killed or a group of bacteria may be dispersed into a new environment to find good food sources. To simulate this phenomenon, an elimination and dispersal step is evoked after \(N_\mathrm{re}\) reproduction steps. Each bacterium in the population may be eliminated and dispersed into a new location with a given probability \(P_\mathrm{ed}\). The rule is shown in the following:

$$\begin{aligned} \theta = \left\{ \begin{array}{ll} \theta _\mathrm{new}, &{}\quad \hbox {if}~q < P_\mathrm{ed}\\ \theta , &{}\quad \hbox {otherwise} \end{array}, \right. \end{aligned}$$
(5)

where q is a random number uniformly distributed in [0, 1], \(\theta \) is the current solution associated with a bacterium, \(\theta _\mathrm{new}\) is a new random solution generated by re-initialization using a random-walk behavior. That is, for each bacterium, if the number generated randomly is smaller than \(P_\mathrm{ed}\), it will move to a new random solution; otherwise, it will keep the original solution unchanged. This mechanism generates new random solutions for some bacteria and makes these bacteria search from new starting points. Essentially, this mechanism is quite similar to macromutation in the classical evolutionary algorithms, and is important in exploring new search regions over the whole search space, which helps the bacterial population to jump out of local optima.

2.7 Post-processing

When the iterative loops terminate, we obtain a bacterial individual with the largest modularity score. The bacterial individual can be decoded into a set of clusters, and each cluster corresponds to a protein module. Therefore, the preliminary modules are generated through the bacterial foraging optimization process. To further improve the detection quality, two post-processing steps are adopted to refine the preliminary modules.

The first step is to merge preliminary modules in light of functional similarity based on the annotation information of GO. A merging module comes from two preliminary modules which are close in function. Specifically, iteratively merge two modules with the largest functional similarity until there are no such two modules whose functional similarity is larger than the merging threshold \(\lambda \). The functional similarity between two modules \(h_{1}\) and \(h_{2}\) is defined as:

$$\begin{aligned} S(h_{1}, h_{2}) = \frac{\sum \nolimits _{i\in h_{1}, j\in h_{2}} S(i,j) }{\hbox {min} (|h_{1}|, |h_{2}|)}, \end{aligned}$$
(6)

where

$$\begin{aligned} S(i,j) = \left\{ \begin{array}{ll} 1 &{}{\quad } \hbox {if}~ i = j\\ f_{ij} &{} {\quad }\hbox {if}~ i \ne j, \hbox {and}~ (i,j)\in E\\ 0 &{} {\quad }\hbox {otherwise} \end{array}.\right. \end{aligned}$$
(7)

The second step is a simple filtering operator, and its goal is to exclude some sparsely connected modules from the topological structure perspective. To do this, first compute the density of each module and then remove the module whose density is smaller than the predetermined filtering threshold \(\delta \). The density of a module is given by (Li et al. 2010):

$$\begin{aligned} D_{h} = \frac{e_{h}}{n_{h} \cdot (n_{h} - 1) / 2}, \end{aligned}$$
(8)

where \(n_{h}\) is the number of nodes and \(e_{h}\) is the number of interactions in module h.

By the above two post-processing strategies, the preliminary modules are further refined in the view of functional similarity and topological structure, and the final functional modules are obtained.

figure c
Table 1 Datasets used in experiments

2.8 Algorithm description

According to the previously detailed introduction, the pseudocode of the proposed BFO-FMD algorithm is given in Algorithm 1. This algorithm first starts with an initial bacterial population, each of which represents a candidate module partition and is constructed by a random-walk behavior. Then, it executes a triple loop: chemotaxis, reproduction, and elimination and dispersal loops. At each chemotaxis loop, each bacterium in a population finds a new solution through successively carrying out two operators chemotaxis and conjugation and makes a choice between the current solution and the new solution by a greedy selection strategy. In other words, if the modularity score of the new solution is higher than that of the current one, the bacterium will choose the new solution; otherwise, it will keep the current one. After \(N_{c}\) chemotaxis loops, the bacterial population performs a reproduction loop. This process selects \(S_{r}\) bacteria with higher modularity scores and splits each of them into two bacteria while rejecting the other \(S_{r}\) bacteria with lower modularity scores. When \(N_\mathrm{re}\) reproduction loops have been carried out, the bacterial population starts an elimination and dispersal loop. For each bacterium, if a number randomly generated between 0 and 1 is less than the given probability \(P_\mathrm{ed}\), it will be eliminated and re-initialized. After the triple loop, two post-processing operators are performed to further refine the obtained best solution. In essence, chemotaxis, conjugation, and elimination and dispersal have responsibilities to keep the balance between exploitation and exploration. BFO-FMD performs exploitation processes on each candidate solution using chemotaxis and conjugation mechanisms, and exploration processes using the elimination and dispersal mechanism. Reproduction picks the elite individuals with higher modularity scores, which can help to accelerate the convergence speed.

3 Experimental evaluation

In this section, we will present our extensive experimental results to show the performance of the proposed BFO-FMD algorithm. The experimental platform is a PC with Inter(R) Core(TM) i5-3470 CPU 3.20 GHz, 4 GB RAM, and Windows 7, and BFO-FMD is implemented using the Java language.

3.1 Datasets

Three new publicly available yeast PPI datasets including DIP ScereCR20150101, DIP Scere20150101, and MIPS datasets are used in the experiments. Table 1 shows a summary of the datasets, where the 2nd column provides the web links, the 3rd and 4th columns are the number of proteins (P) and interactions (I) in source data, and the 5th and 6th columns give the number of proteins and interactions after deleting all the self-connected and repeated interactions. To evaluate the functional modules discovered by a computational method, the set of real functional modules from Friedel et al. (2008) is selected as the benchmark. This benchmark contains 428 gold standard functional modules and is constructed from three main sources: MIPS (Mewes et al. 2004; Aloy et al. 2004) and the SGD database (Dwight et al. 2002) based on the GO notations.

3.2 Evaluation metrics

Three types of popular evaluation metrics are employed to evaluate the quality of the detected modules (Li et al. 2010).

3.2.1 Precision, Recall, and F-measure

Precision, Recall, and F-measure are three common evaluation metrics in information retrieval and machine learning. Using this type of metric method, it is necessary to define how well a detected module \(h = (V_{h}, E_{h})\) matches a benchmark standard module \(s = (V_{s}, E_{s})\). Many researches use a neighborhood affinity score (NA) to assess the matching degree, which is given as follows:

$$\begin{aligned} \hbox {NA}(h,s) = \frac{|V_{h} \bigcap V_{s}|^{2}}{|V_{h}| \times |V_{s}|}. \end{aligned}$$
(9)

If \(\hbox {NA}(h, s) \ge \omega \), two modules h and s are considered to be matched (generally, \(\omega = 0.2\)). Let H be the set of detected modules and S be the set of gold standard functional modules. The number of the detected modules in H which at least matches one standard module in S is denoted by \(N_\mathrm{ch} = |\{h | h \in H, \exists s \in S, \hbox {NA}(h,s) \ge \omega \}|\), while the number of the standard modules in S which at least matches one detected module in H is indicated by \(N_\mathrm{cs} = |\{s | s \in S, \exists h \in H, \hbox {NA}(h, s) \ge \omega \}|\). Thus, Precision and Recall are given below:

$$\begin{aligned} \hbox {Precision} = \frac{N_\mathrm{ch}}{|H|}, \end{aligned}$$
(10)

and

$$\begin{aligned} \hbox {Recall} = \frac{N_\mathrm{cs}}{|S|}. \end{aligned}$$
(11)

F-measure is a harmonic mean of Precision and Recall. So it can be used to evaluate the overall performance and is expressed as:

$$\begin{aligned} F\hbox {-measure} = \frac{2 \times \hbox {Precision} \times \hbox {Recall}}{\hbox {Precision} + \hbox {Recall}}. \end{aligned}$$
(12)

According to the above definitions, the values of Precision and Recall are actually ratios and range from 0 to 1. For Precision, the closer it is to 1, the more the detected modules which successfully match at least one standard module, and the better the algorithm. When the value of Precision is 1, all the detected modules can match at least one standard module. For Recall, the closer it is to 1, the more the standard modules which successfully match at least one detected module, and the better the algorithm. When the value of Recall is 1, all the standard modules can match at least one detected module. As an overall indicator of Precision and Recall, F-measure also takes values in the range of 0–1, and the closer it is to 1, the better the algorithm.

Table 2 Results of various algorithms on different datasets

3.2.2 Sensitivity, positive predictive value, and accuracy

Sensitivity (Sn), positive predictive Value (PPV), and accuracy (Acc) are three usual measures to evaluate the performance of a detection method. Let \(T_{ij}\) be the number of the common proteins in ith standard module and jth detected module, then Sn and PPV are defined as:

$$\begin{aligned} \hbox {Sn} = \frac{\sum \nolimits _{i=1}^{|S|}\max \nolimits _{j}\{T_{ij}\}}{\sum \nolimits _{i=1}^{|S|} N_{i}}, \end{aligned}$$
(13)

and

$$\begin{aligned} \hbox {PPV} = \frac{\sum \nolimits _{j=1}^{|H|}\max \nolimits _{i}\{T_{ij}\}}{\sum \nolimits _{j=1}^{|H|}T_{\cdot j}}, \end{aligned}$$
(14)

where \(N_{i}\) is the number of the proteins in the ith standard module, and \(T_{\cdot j} = \sum \nolimits _{i}^{|S|}T_{ij}\).

Acc is a comprehensive metric and is defined as the geometric mean of Sn and PPV:

$$\begin{aligned} \hbox {Acc} = (\hbox {Sn} \times \hbox {PPV})^{1/2}. \end{aligned}$$
(15)

From the above definitions, the values of Sn and PPV are also ratios and range from 0 to 1. For Sn, the closer it is to 1, the detected modules give the better coverage of the standard modules, and the better the algorithm. When the value of Sn is 1, the detected modules can completely cover the standard modules. For PPV, the closer it is to 1, the detected modules are more likely to be the standard modules, and the better the algorithm. When the value of PPV is 1, the detected modules are the same as the standard modules, and the better the algorithm. As an overall indicator of Sn and PPV, Acc also takes values in the range of 0–1, and the closer it is to 1, the better the algorithm.

3.2.3 P value measure

P value, known as a metric of functional homogeneity, is usually used to substantiate the biological significance of the detected modules. It is the probability of co-occurrence of proteins in a detected module with a common function and is expressed as follows:

$$\begin{aligned} p = 1 - \sum \limits _{i=0}^{k-1}\frac{\left( \begin{array}{l} |F| \\ i \end{array}\right) \left( \begin{array}{l} |V|-|F|\\ |h|-i \end{array}\right) }{\left( \begin{array}{l} |V| \\ |h| \end{array}\right) }, \end{aligned}$$
(16)

where |V| is the number of all the proteins in a PPI network, |h| is the number of proteins in a detected module h, |F| is the number of proteins in a reference function F, and k represents the number of common proteins in the reference function F and the detected module h. Low p value of a detected module generally means that this module is not merely enriched by proteins from the same function by chance, and thus, this module has high statistical significance and is likely to be a real protein module. In general, a detected module with p value \({<}0.01\) is considered to be statistically significant.

3.3 Comparative evaluations

To demonstrate the performance of the proposed BFO-FMD algorithm, we compared it with six competitive methods: NACO-FMD (Ji et al. 2012a), ACC-FMD (Ji et al. 2015), COACH (Wu et al. 2009), Jerarca (Aldecoa and Marín 2010), CFinder (Adamcsek et al. 2006), and MCODE (Bader and Hogue 2003) on three yeast datasets in Table 1. Among them, NACO-FMD and ACC-FMD are two swarm intelligence-based algorithms introduced before. COACH is a core attachment-based algorithm, which first detects protein cores and then includes attachments into these cores to form biologically meaningful structures. Jerarca belongs to hierarchy-based algorithm, and it can provide optimal partitions of the trees using statistical criteria based on the distribution of intra-cluster and inter-cluster connections. CFinder and MCODE are two density-based algorithms. CFinder first identifies the k-cliques using clique percolation and then combines the adjacent k-cliques to get the functional modules. According to the node’s neighbor local density, MCODE fist picks out seed nodes for initial clusters and then further augments these clusters to form the final clusters. In the experiments, NACO-FMD adopted the same parameters as Ji et al. (2012a), and ACC-FMD employed the same parameters as Ji et al. (2015). For methods COACH, Jerarca, CFinder, and MCODE, we obtained their software implementations and used the default values for their parameters in the experiments.

As for our BFO-FMD algorithm, there are several parameters. In the following, we will describe their roles. S is the size of bacterial population and determines number of individuals to search better module partitions. \(N_{c}\) is the maximum number of chemotaxis and determines the number of times searching around a candidate solution. \(N_{s}\) is the maximum number of swimming during each chemotaxis and determines the number of steps that an individual swims toward a specific direction. \(N_\mathrm{re}\) is the maximum number of reproduction and determines the number of times that the algorithm selects superior solutions. \(N_\mathrm{ed}\) is the maximum number of elimination and dispersal, which determines the number of times that the bacterial population explores some other regions of the search space. \(P_\mathrm{ed}\) is the probability of elimination and dispersal and determines the number of individuals dispersed into some other regions to search better module partitions. \(P_{co}\) is the probability of conjugation and determines the number of individuals searching for better module partitions under the help of other individuals. \(P_\mathrm{ch}\) is the probability used in generating a masking vector and determines the strength of local search around a candidate solution in the chemotaxis process. L is the conjugation length and determines the strength of local search around a candidate solution in the conjugation process. \(\varepsilon \) is the similarity threshold between nodes and determines the quality of initial solutions. \(\lambda \) is the merging threshold, and \(\delta \) is the filtering threshold, and they determine the strength of post-processing operators. For these parameters, too big or too small values have serious effects on the solution quality and the runtime efficiency. To select a set of relatively good values, we have done a series of hand-tuning experiments and took \(S=100\), \(N_{c}=100\), \(N_{s}=4\), \(N_\mathrm{re}=4\), \(N_\mathrm{ed} = 2\), \(P_\mathrm{ed} = P_{co}=0.2\), \(P_\mathrm{ch}=0.05\), \(L=0.05 * |V|\), \(\varepsilon = 0.4\), \(\lambda = 0.2\), and \(\delta = 0.05\).

Fig. 4
figure 4

Comparative results of seven methods in terms of various evaluation metrics on ScereCR20150101 dataset

Table 2 provides the basic information of the detection results for the seven algorithms on the three datasets. For each algorithm, we have listed the number of detected modules (Number of modules), the average number of proteins in each module (Average size of modules), the number of detected modules which match at least one gold standard module (\(N_\mathrm{ch}\)) and the number of gold standard modules that match at least one detected module (\(N_\mathrm{cs}\)). Taking BFO-FMD on ScereCR20150101 dataset as an example, it has detected 324 modules, of which 158 match 244 gold standard modules, and the average size of the 324 detected modules is about five proteins. From the table, MCODE always obtains the least number of modules (number of modules), ACC-FMD gets the biggest module (average size of modules), and BFO-FMD usually generates more and smaller modules and the number of gold standard modules that match at least one detected module (\(N_\mathrm{cs}\)) is always much more than all the other algorithms on the three datasets.

Fig. 5
figure 5

Comparative results of seven methods in terms of various evaluation metrics on Scere20150101 dataset

Table 3 Distribution comparisons of the p values of protein modules obtained by different algorithms on ScereCR20150101

Figures 4, 5, and 6 show the overall comparison results of seven algorithms on the three datasets in terms of various evaluation metrics including Precision, Recall, F-measure, Sensitivity, PPV, and Accuracy. On the core dateset of yeast (ScereCR20150101) in Fig. 4, one easily see that our BFO-FMD algorithm achieves the best performance on five out of six metrics except Precision statistic. In detail, ACC-FMD and MCODE take the first and second places in term of Precision, respectively. However, ACC-FMD usually mines high overlapping modules and many of which are very similar, and MCODE detects very few protein modules (only 102 in Table 2), which leads to higher Precision values. The Precision of our algorithm is 0.487, which is almost the same as those of COACH (0.493) and CFinder (0.506), and higher than those of the remaining two methods NACO-FMD (0.355) and Jerarca (0.371). BFO-FMD algorithm matches the most real protein modules and obtains the highest Recall (0.57), which is much better than all the other six algorithms. On the whole, due to the balance of Precision and Recall, BFO-FMD algorithm also attains the highest F-measure (0.526). As for another three metrics of Sensitivity, PPV, and Accuracy, our algorithm still plays the best among the seven algorithms. Its corresponding values are 0.381, 0.357, 0.359 in terms of Sensitivity, PPV, Accuracy, respectively, which are a little better than those of NACO-FMD (0.378, 0.341, and 0.359) and much better than the other five algorithms.

Fig. 6
figure 6

Comparative results of seven methods in terms of various evaluation metrics on MIPS dataset

Table 4 Some functional modules predicted by BFO-FMD using ScereCR20150101 data
Table 5 Some functional modules predicted by NACO-FMD using ScereCR20150101 dataset
Table 6 Some functional modules predicted by ACC-FMD using ScereCR20150101 dataset
Table 7 Some functional modules predicted by COACH using ScereCR20150101 dataset
Table 8 Some functional modules predicted by Jerarca using ScereCR20150101 dataset
Table 9 Some functional modules predicted by CFinder using ScereCR20150101 dataset
Table 10 Some functional modules predicted by MCODE using ScereCR20150101 dataset

Figure 5 shows that our BFO-FMD algorithm performs extremely well on larger Scere20150101 dataset, while other algorithms suffer performance degradation compared to their results on smaller cereCR20150101 dataset. More specifically, BFO-FMD algorithm has overwhelming superiority in terms of five metrics: Precision, Recall, F-measure, PPV, and Accuracy. Only on Sensitivity statistic, CFinder achieves better than our BFO-FMD algorithm. It is actually because this algorithm obtained an impossibly huge module including 1858 proteins, so the great majority of proteins in benchmark set are covered by this very big module, which results in a high Sensitivity value for CFinder according to the calculation formula in Eq. (13).

For another larger MIPS dataset shown in Fig. 6, the proposed BFO-FMD algorithm still yields good results, even though it does not achieve as well as it does on Scere20150101. BFO-FMD attains the best performance in terms of Recall, F-measure, and PPV. On Precision statistic, BFO-FMD is only worse than MCODE. However, we do not think we should give MCODE a highest ranking on this statistic, since it predicted very few modules (65 in Table 2). The sensitivity value of our algorithm is 0.285, which is in third place out of the seven algorithms and is slightly lower than that of the second place CFinder (0.31). There is a relatively large gap between BFO-FMD and the first place NACO-FMD on this statistic. So BFO-FMD obtains the next best performance after NACO-FMD in term of Accuracy which is a comprehensive metric of Sensitivity and PPV, despite the fact that our algorithm achieves the best performance on PPV statistic.

The above outstanding experimental results of BFO-FMD on the three datasets fully demonstrate that BFO-FMD not only performs perfectly on smaller dataset (ScereCR20150101), but also plays very well on some relatively larger datasets (Scere20-150101 and MIPS). Therefore, the proposed BFO-FMD algorithm is a robust algorithm whose superior performance is not dependent on the underlying data.

Fig. 7
figure 7

Anaphase-promoting module detected by various algorithms: a Benchmark, b BFO-FMD, c ACC-FMD, d COACH, e Jerarca, f MCODE

Table 3 illustrates the relative performance of various algorithms in terms of p values, where the first column gives three types of p values, the second column lists seven algorithms, the third to sixth columns present the number of statistically significant modules located in the corresponding range. The last column is the proportion of significant modules (p value \({<}0.01\)). In this table, we notice that MCODE and ACC-FMD get higher ratio than other methods. It is mainly because MCODE obtains the minimum amount of modules, and ACC-FMD is designed to search high overlapping modules from the PPI networks, which leads to their high ratios. As for other five algorithms, there is no major difference on the ratio, and our BFO-FMD method ranks in the middle. It is worth noting that most modules detected by various algorithms stand in ranges (1.0e−20, 1.0e−10] and (1.0e−10, 0.01], while only a few modules fall into the intervals (0, 1.0e−30] and (1.0e−30, 1.0e−20], where BFO-FMD has obvious advantages comparing with other algorithms except ACC-FMD. This result illustrates that the proposed BFO-FMD method is able to detect more modules with very high statistical significance.

To further investigate the computational results, 10 modules with low p values and high matching degree mined by different methods on ScereCR20150101 dataset are presented in Tables 4, 5, 6, 7, 8, 9, and 10. In these tables, the first column is a module identifier. The second and third columns give the number of proteins, and the proteins contained in each module, respectively. The fourth column provides the corresponding gold standard protein module. The fifth column refers to the matching degree measured by the neighborhood affinity score (NA) between a predicted module and a gold standard module. The last three columns list three types of p values of predicted modules from the view of biological process, cellular component, and molecular function. According to the NA metric in the fifth column, it is known that many modules detected by the seven methods match well with the benchmark modules. All the p values in these tables are very low, which perfectly demonstrates the modules detected by these computational methods have high statistical significance from three different GO categories.

To explicitly reveal the results obtained by our BFO-FMD algorithm, we take the anaphase-promoting module as an example to explain. Tables 4, 5, 6, 7, 8, 9 and 10 show that five algorithms BFO-FMD, ACC-FMD, COACH, Jerarca, and MCODE have detected the anaphase-promoting module (the 3rd module in Tables 4, 5, 6, 7, 8, 9, and 10). Moreover, according to the NA statistic, BFO-FMD obtains the highest matching degree between the predicted anaphase-promoting module and the standard one among the five algorithms. Figure 7 illustrates the module structures for the gold standard anaphase-promoting module and the detected ones for the five algorithms. Figure 7a shows the gold standard anaphase module, which contains 16 proteins, and one of them (ygl116w) is isolated by other proteins in the same module. The anaphase-promoting module detected by our BFO-FMD algorithm is given in Fig. 7b. This module consists of 14 proteins and succeeds in matching all the 14 proteins, which are more than those of other algorithms, and only one protein is missing in contrast with the standard one except the isolated protein (ygl116w). The other four algorithms ACC-FMD, COACH, Jerarca, and MCODE only cover 13, 11, 13, 10 proteins in standard anaphase-promoting module, as shown in Fig. 7c–f, respectively. This example intuitively demonstrates that the proposed BFO-FMD algorithm is able to more accurately detect functional modules.

Table 11 Some new functional modules predicted by BFO-FMD using ScereCR20150101 dataset

In addition, the proposed BFO-FMD algorithm has the potential to discover some new modules with biological significance from the statistical point of view. Table 11 lists five new modules with very low p values on the ScereCR20150101 dataset. These five modules can be inquired on the Web site, but are not described in the set of benchmark modules, and the 1st, 3rd, and 5th modules are not found by any of other six algorithms. This result suggests that BFO-FMD has more powerful exploratory ability to detect functional modules from a PPI network and can help biologists predict some new protein modules to a certain extent.

In short, the above experimental results and analysis have justified the effectiveness of the proposed BFO-FMD algorithm. Specifically, BFO-FMD is not only better than some famous non-swarm intelligence-based algorithms (COACH, Jerarca, CFinder, and MCODE), but is also superior to some new developed swarm intelligence-based algorithms (NACO-FMD and ACC-FMD). It is because that BFO-FMD has four biological mechanisms (chemotaxis, conjugation, reproduction, and elimination and dispersal) designed appropriately and obtains good ability of balancing global search and local search.

4 Conclusions

The identification of functional modules in a PPI network is important for biological knowledge discovery since many important biological processes in the cell are carried out through the formation of protein modules. Nowadays, the computational approaches based on swarm intelligence have been a kind of effective ways for functional module detection due to their good robustness performance. In this paper, a new swarm intelligence algorithm based on bacterial foraging optimization was proposed for detecting protein modules in a PPI network (called as BFO-FMD). BFO-FMD first combines the topology and function information between protein nodes to construct a candidate solution for each bacterial individual according to a random-walk behavior. Then, it accomplishes information exchange and optimizes each candidate solution by four biological mechanisms: chemotaxis, conjugation, reproduction, and elimination and dispersal. At last, two post-processing operators are performed to refine the detected result in light of function similarity and topological structure. To demonstrate the performance of BFO-FMD, we have carried out a series of experiments on three common yeast datasets in terms of various evaluation metrics. The empirical results illustrate that BFO-FMD achieves prominent Recall, F-measure, and PPV while performing very well in terms of other metrics, such as Precision, Sensitivity, Accuracy, and P values. These results indicate that the proposed BFO-FMD algorithm is able to detect effectively protein modules and can be considered as a complement to help biologists to get some novel biological insights.