Keywords

1 Introduction

Bioinformatics has been a rapidly growing field in the last years. Certain biological problems can be modelled using networks, most notably gene regulatory networks [1] and protein-protein interaction (PPI) networks [2]. Solutions to network problems, which are relatively well studied in computer science, are often regarded as valuable for biologists [3].

In this paper, we present a unified study of combinatorial optimisation problems in analysis of PPI and protein folding (PF) networks. The aim of this paper is to explore the unique area at the intersection of two areas of applied evolutionary computation and computational intelligence in general. On one hand, it spans the computational intelligence in bioinformatics and on the other hand, we explore the biological networks using methodologies of evolutionary computation and heuristics in combinatorial optimisation.

Contributions. Using a combination of classical and randomised search heuristics, we obtain high-quality solutions to some of the well-known combinatorial optimisation problems in PPI and PF networks, which are known to be NP-hard in general [4, 5].

Experimental results are presented for networks of the European COSIN project [6]. For four different PPI networks, we obtain optimal solutions to maximum independent set and minimum vertex clique covering problem. We used a combination of greedy approximation algorithm for maximum independent set in sparse graphs [7] with a hybrid of iterated greedy (IG) clique covering and randomised local search (RLS) for maximum independent set [8]. For three of four PPI networks, we obtain optimal solutions to maximum clique and chromatic number problems using a hybrid of Brélaz’s heuristic [9] with iterated greedy graph colouring algorithm [10]. To explore the minimum dominating set problem, we use a classical greedy approximation algorithm [11].

In addition, we apply the same techniques to a PF network, which is considerably larger than PPI networks. A reduced variant of this PF network is explored, too. We obtain that PF network has slightly different properties than PPI networks, which is probably related both to its size and structure. However, we obtained a very small gap between bounds for maximum clique size and chromatic number of this network, too.

The paper is organised as follows. In Sect. 2, we present an overview of the topic from several relevant perspectives. In Sect. 3, we present our approach to study of PPI and PF networks. In Sect. 4, we present the experimental results and their possible application. Finally, in Sect. 5, we formulate conclusions and summarise scientific problems, which remain open.

2 Combinatorial Optimisation Problems in Protein-Protein Interaction and Protein Folding Networks

There is a body of work concentrating on computer-scientific aspects of study of biological networks. In this section, we present an overview of relevant research and perspectives on our topic.

Protein-protein interaction (PPI) networks. Vertices of a PPI network represent proteins and edges represent interactions between them. These are constructed by molecular biologists usually as an outcome of two-hybrid screening experiments [3]. Analysis of PPI networks and their comparison represent common research topics [12], along with development of analytical software for biological networks [13]. In our experiments, we study public domain PPI network data of the European COSIN project [6]. These include PPI networks for bacterium Escherichia Coli, commonly found in gastrointestinal tract; nematode worm Caenorhabditis elegans; Helicobacter pylori, a bacteria associated with gastritis, usually found in upper gastrointestinal tract; and Saccharomyces cerevisiae, a commonly used species of yeast. PPI network data for yeast are a common subject of study [14, 15].

Clustering of PPI networks. Probably the most well-known topic in computer-scientific research of PPI networks is represented by clustering of these networks, i.e. decomposition into relatively dense subgraphs. In PPI networks, this is motivated by the problems of complex and functional module detection, which aim to identify groups of mutually interacting proteins, which might often be involved in the same biological processes [16, 17].

It is worth noting that biologists tend to distinguish between the term “complex” and “module”. Complex in PPI network refers to a molecular machine of proteins, which bind to each other at the same time and space, while the term module refers to a group of mutually interacting proteins, which control certain cellular function, without taking the spatial and temporal aspect into account [18]. However, experimentally obtained PPI data often do not incorporate this information in the network. PPI network data are valuable in reconstruction of metabolic and signalling pathways [3], understanding of cell regulation, prediction of role of uncharacterised proteins and for possible therapy [18]. Multifunctional proteins have previously been revealed [19], i.e. discovery of overlapping modules is a relevant topic for PPI networks, too [20]. One way how they can be detected, is the use of clique merging [21].

Clustering of PPI networks has many similarities with detection of community structure in social networks [22]. Both areas suffer from existence of a large number of diverse clustering algorithms, using ideas ranging from information flow simulation [23], spectral properties of adjacency matrices [24, 25], cost-based clustering [26], to stochastic optimisation techniques [18]. However, quality of such a clustering algorithm can be evaluated using a wide spectrum of metrics and multiple objective functions can be considered [27]. Both clustering quality and applicability of developed methods to large networks seem to be important [28]. It can be observed that different clustering algorithms may output very different clusters, each having a different desirable property of a dense or well separated network substructure [29]. Therefore, multiobjective optimisation was successfully applied to network community detection [30]. However, assessing quality of a clustering of a biological network [31] remains hard and often requires to fall back to usage of a reference solution [18, 30] or simply requesting verification from a biologist. Additionally, clustering or partitioning of a network [32] might often lead to NP-hard combinatorial optimisation problems [33], which generally require specific attention [4, 5].

Protein folding (PF) network beta3s . This network represents conformation space of a 20 residue antiparallel \(\beta \)-sheet peptide investigated by NMR spectroscopy. Vertices represent conformations and edges represent transitions. The network seems to represent a complex system, in which spontaneous folding of protein is modelled as a (weighted) random walk on the conformation space network. Due to space and methods being used, we only consider the structure of the network and omit the weights [34].

PPI and PF networks have also been previously studied in the context of centrality metrics and their stability and potential decomposition [35]. Enumerative and spectral analytical methodologies were also used to study their structure [24]. Statistical analysis of complex networks helps in understanding of the large-scale properties of these networks, too [36].

Combinatorial optimisation problems in networks. We investigate five different classical NP-hard combinatorial optimisation problems [4, 5]. For simplicity, we describe these problems only less formally.

Maximum clique is the largest subgraph, in which each pair of vertices is adjacent. In the context of PPI networks, it is the largest group of proteins, in which all proteins mutually interact. Maximum clique size is denoted by \(\omega \). There is a spectrum of algorithms for this problem [37].

Graph colouring is an assignment of colours to vertices such that each for each edge, its vertices are differently coloured. Minimum number of colours needed to obtain a graph colouring is called chromatic number and is denoted by \(\chi \). Chromatic number is useful, since for each graph, it holds that \(\omega \le \chi \) [38], i.e. maximum clique and chromatic number represent bounds for each other. Randomised algorithms are frequently used to solve graph colouring, too [39].

Maximum independent set is the largest subgraph, in which no pair of vertices is adjacent. Maximum independent set size is denoted by \(\alpha \). In a PPI network, independent set is the largest set of mutually non-interacting proteins.

Minimum vertex clique covering is a partitioning of the network into as few non-overlapping cliques as possible. In PPI networks, it represents a problem of finding the minimum number of clusters such that within each cluster, all proteins must be mutually interacting. The number of cliques in a minimum vertex clique covering is denoted by \(\vartheta \). Similarly to maximum clique and graph colouring, it holds that \(\alpha \le \vartheta \) [8]. Hence, maximum independent set and minimum vertex clique covering represent bounds for each other, too.

The last studied problem is the minimum dominating set problem. Minimum dominating set is the smallest subset of vertices such that each vertex is either in the dominating set or has a neighbour in it. Minimum dominating set size is denoted by \(\gamma \). For PPI networks, dominating set represents a set of “central” proteins such that all other proteins interact with at least one protein of the dominating set.

3 Our Experimental Approach

figure a

Graph-theoretical approaches represent a vital part of the tools used to analyse biological networks [43]. We aim to provide an approach for their exploration, which ensures solid generalisation and computes properties, which are naturally related to functional module identification. Indeed, large cliques, independent sets and dominating sets represent such properties. Additionally, these problems have clear definitions and approaches, which can easily be applied to previously unexplored PPI or possibly other biological networks. The aim is to provide a hybrid technique, providing bounds for several well-defined valuable properties of an unknown network, which lead to NP-hard combinatorial optimisation problems.

This way, we are able to characterise the structure of the networks using cliques, independence and domination and avoid the broad notion of general clustering.

To carry out our investigations, we use a collection of classical heuristics, as well as order-based stochastic algorithms to find high-quality solutions to our combinatorial optimisation problems. The main process of mining from the network data is characterised by the pseudocode of Algorithm 1.

Let us now describe the steps in a slightly more detailed way. Due to lack of space, we are not able to review all aspects of the algorithms we used. However, an interested reader may refer to the referenced work.

In steps 1–7, we use a simple greedy clique algorithm. It starts with an empty clique and orders vertices from largest degree to the smallest. It puts the current vertex to the clique if and only if the clique property is not violated by adding the new vertex. In fact, this approach is equivalent to use of greedy algorithm for independent set [7] for the complement of our graph.

In step 8, we use Brélaz’s heuristic implemented with binary heap to find a colouring of the network in \(\mathcal {O}(m \log n)\) time, where n is the number of vertices and m is the number of edges.

If maximum clique from steps 1–7 and number of colours used in step 8 are not equal, we use iterated greedy (IG) graph colouring search heuristic [10, 42], combined with randomised local search (RLS) for maximum clique. This is represented by steps 9–10 of Algorithm 1. We start with clique and colouring found in steps 1–8. IG uses randomised block-based moves to possibly reduce the colouring. RLS for maximum clique has not previously been used. Therefore, we describe it in more detail.

RLS for maximum clique uses the same algorithm for clique construction as in steps 1–7. However, it works with a predefined permutation instead of ordering the vertices by their degrees. In the beginning, vertices of clique Q are put into a permutation first and other vertices are ordered at random after that. In each time step of RLS, jump move is attempted. The jump move simply takes a uniformly random vertex from the permutation and puts it to the first position in the permutation. The other vertices are then shifted to the right. Resulting permutation is used to construct a new clique and is accepted if the new clique is at least as large as the current one.

In step 11, we use the greedy approximation algorithm for maximum independent set in sparse and bounded degree graphs [7]. We use binary heap as a priority queue.

In step 12, we apply the recently proposed IG heuristic for minimum vertex clique covering with RLS for maximum independent set [8].

In step 13, the greedy approximation algorithm for dominating set is used to compute an upper bound for minimum dominating set size [11]. Additionally, a lower bound for the size of minimum dominating set is computed in steps 14–15. This lower bound represents a maximum of three different lower bounds. One of them is the number of components c, the second bound is a general bound derived from maximum degree \(\varDelta \) and the third bound is implied by logarithmic approximation guarantee of the greedy algorithm.

Note that our approach is not specifically restricted to PPI and PF networks. It can easily be applied to social networks or other complex network data. However, for the purpose of this study, we focus specifically on its suitability to explore biological network data.

4 Experimental Results and Discussion

We performed the evaluation in two parts. We first used the approach without the stochastic techniques based on IG and RLS (i.e. we omitted steps 10 and 12). Hence, we used only greedy algorithms. To provide an upper bound for \(\vartheta \), we used Brélaz’s heuristic applied to complementary graph \(\overline{G}\). \(\overline{G}\) contains edges between pairs of vertices, which are not adjacent in G and vice versa. In Table 1, we present the best results obtained by this approach in 20 independent runs.

For evaluation of the impact of stochastic components of the approach, we then used the full approach, as specified by Algorithm 1. These results are presented in Table 2. Similarly, we performed 20 independent runs for PPI networks and the reduced PF network beta3s.reduced and present the best results obtained. For the large PF network beta3s, we performed only one long run.

The stochastic subroutines of our approach were parameterised as follows. For IG for graph colouring and RLS for maximum clique, we used a simultaneous implementation with 5 iterations of RLS per one iteration of IG. Stochastic optimisation was stopped when 100n iterations without improvement of neither clique nor colouring were encountered. Similarly, IG for minimum vertex clique covering and RLS for maximum independent set were used in an implementation with 5 iterations of RLS per one iteration of IG. Stopping criterion was similar, too. Optimisation was stopped when 100n iterations without improvement of neither clique covering nor independent set were encountered. Interestingly, these stopping criteria led to results with good quality and solid scalability for all four of these problems.

Both Tables 1 and 2 have the following structure. The first column contains the name of the network. Its number of vertices n, number of edges m, number of connected components c and the number of triangles \(\tau \) are specified along with the name. The next columns present the maximum clique size \(\omega \), chromatic number \(\chi \), maximum independent set size \(\alpha \), minimum clique covering size \(\vartheta \) and minimum dominating set size \(\gamma \). If a cell contains only one value, it means that the value is a numerically proven optimum for the particular characteristic. If it contains two values separated by \(-\), it means that the value is located within the interval specified by presented values. Symbol n in the table means that the value is upper bounded only by the number of vertices n. Bold numbers in Table 2 represent values, which were obtained only by the stochastic approach, i.e. randomised search techniques were beneficial for these instances.

Table 1. Experimental results obtained for PPI and PF networks by using only greedy algorithms (i.e. without steps 10 and 12 in Algorithm 1).
Table 2. Experimental results obtained for PPI and PF networks by using the full stochastic approach, including IG and RLS algorithms (i.e. full Algorithm 1, including steps 10 and 12). Bold values represent instances, for which IG and RLS provided improved results compared to purely greedy algorithms.

Additionally, we also performed listing of maximal cliques for each network [46]. A clique is maximal if it is not a subgraph of some other clique. The reason is to confront of the number of maximal cliques and maximum (i.e. largest) cliques and to further analyse the cliques as building blocks of the networks.

Network ecoli contains a maximum clique of 6 mutually interacting proteins. Using emumeration based on triangles, we found that there are 657 maximal cliques of size at least 3. There are 5 of these cliques, which consist of 6 proteins. The network ecoli can be partitioned into \(\vartheta = 161\) non-overlapping cliques, with an average size of such a clique being 1.678. There also is a dominating set of 69 proteins, for which it holds that all other proteins interact with at least protein of this set.

For network elegans, we have that its maximum clique size is 3 and there are 39 triangles representing maximum cliques. It can be partitioned into 294 non-overlapping cliques. The average size of such a clique is 1.276, which makes it the network with the smallest average clique size in minimum vertex clique covering. This is understandable, since this network is the sparsest. It contains a dominating set consisting of 71 vertices.

For network helico, we obtained a clique of size 3, while we were only able to find a 4-colouring. This is the only PPI network, for which we obtained a gap between an estimate for maximum clique size and chromatic number. Using enumeration, we found that there is no 4-clique and the number of triangles of mutually interacting proteins is 76. However, this confirms that while chromatic number can be used as a good upper bound on the size of the maximum clique of mutually interacting proteins, it seems that one cannot guarantee that these values for PPI networks will be equal. Network helico can be partitioned into 528 non-overlapping cliques of average size 1.386. It also contains a dominating set of size 164.

Instance yeast contains 1872 maximal cliques, which is the largest number of maximal cliques among the studied PPI networks. However, only 12 of them are also maximum cliques, which contain 7 vertices. These will shortly be discussed below. Network yeast can be partitioned into 2673 non-overlapping cliques, which have average size 1.550. Dominating set on 959 vertices for this network is the largest among the PPI networks, too.

Table 3. Listing of 12 maximum cliques of size 7 in yeast PPI network.
Fig. 1.
figure 1

Visualisation of colourings found for protein-protein interaction networks elegans (upper, on the left-hand side), helico (upper, on the right-hand side), ecoli (lower, on the left-hand side) and yeast (lower, on the right-hand side). These colourings represent good upper bounds for the size of maximum clique of mutually interacting proteins for these PPI networks. Based on the availability of protein labels and expected visual quality, labels or indices of vertices are presented for some of the networks and vertices.

It is not surprising that numbers of maximal and maximum cliques, as well as the properties of non-overlapping and overlapping cliques seem to vary between different networks. Hence, it might be interesting to discuss the properties of large clique a bit further.

Table 3 presents a listing of 12 maximum cliques of size 7 in the yeast PPI network. One can notice that the first clique and the last two cliques consist of proteins, which are not present in other cliques. On the other hand, all other cliques represent extensions of clique CEF1, SEC28, SET1, SFA1, SFB2. This indicates that some interesting substructures might be relatively isolated, while other substructures form larger clusters. These structures can be modelled e.g. by merging cliques [21]. Additionally, large cliques seem to comprise smaller cliques. This suggests that some PPI networks might have a hierarchical structure [47]. While functional modules are formed by groups of cliques, it seems that one can even identify smaller cliques as low-level building blocks of the network. Interestingly, while labels of proteins are naturally dependent on conventions of biologists, some of the identified maximum cliques seem to consist of proteins with lexicographically similar labels.

PF networks have slightly different characteristics. Network beta3s.reduced is atypical due to its reduced representation, which features a drastic cutoff in vertices with low degree. As a consequence, both beta3s and beta3s.reduced contain maximum clique of size 38–39, while the average size of a clique needed to partition beta3s.reduced into non-overlapping cliques is 4.564–4.969. This is a value range previously observed in variations of random graphs and graphs with planted cliques [8]. The original network beta3s requires cliques of size 1.897–2.049 to obtain a minimum vertex clique covering. However, it is worth mentioning that this value is still somewhat higher than the values obtained for PPI networks. This indicates a denser structure of PF networks with some large embedded cliques found in the “core” of the network. Such a phenomenon has not been observed in the studied PPI networks.

Fig. 2.
figure 2

Visualisation of beta3s protein folding network (on the left-hand side), which is the largest of studied networks, with over \(10^5\) vertices and its “core” beta3s.reduced (on the right-hand side). One can easily see that beta3s.reduced is the subgraph, which requires a high number of colours, while the visualisation of beta3s highlights mostly the three colours, which are used to colour most of the vertices in the “outer layer” of the network.

Summarising the above results, combinatorial optimisation properties seem to vary between different PPI networks. Comparison between a purely greedy and a stochastic approach confirms that stochastic optimisation techniques help in combinatorial optimisation for PPI and PF networks. Figure 1 indicates that to a certain extent, PPI networks seem to have similar structure. This figure shows colourings obtained for the PPI networks and groups vertices to layers, based on distance from a vertex with maximum degree. Visualisations reveal dense subgraphs in the proximity of the vertex with maximum degree, while this property seems to be more accentuated for large networks. Figure 2 presents a similar visualisation for the PF networks. This visualisation reveals slightly “layered” structure of beta3s.reduced. In this context, it is not surprising that large cliques are located within beta3s.reduced, while the “outer” layer of beta3s is sparser and seems to contain much smaller cliques.

While yeast network contains relatively large cliques of size 7, networks elegans and helico do not contain a clique larger than a simple triangle. Large cliques may both heavily overlap and represent relatively “isolated” substructures. Properties of cliques in network yeast seem to indicate hierarchical structure. For this purpose, data reductions might represent a promising research direction. A specific case is represented by the large beta3s PF network, which might further be studied in this context, too. Dominating sets were explored using an approximation algorithm. More interesting results might be obtained using a nature-inspired heuristic for this problem, e.g. algorithms based on ant colony optimisation [48].

5 Conclusions

We presented an experimental study of combinatorial optimisation problems in protein-protein interaction (PPI) and protein folding (PF) networks. Studied problems included maximum clique, chromatic number, maximum independent set, minimum vertex clique covering and minimum dominating set. We presented a unified technique to estimate these properties of large networks, which lead to NP-hard problems in general. Our experimental approach revealed several interesting properties of four PPI networks of the European COSIN project, as well as PF network beta3s and its reduced version. Even though the approach was applied to biological networks, its ideas are general and can also be used to analyse other complex networks, such as social networks or research citation networks.

Our investigation found maximum cliques for all PPI networks and provided a very small interval for the maximum clique of PF network beta3s. For all four PPI networks, we found the optimal solution to the problem of their partitioning into non-overlapping cliques. We confronted our method with the use of stochastic elements of iterated greedy (IG) and randomised local search (RLS) algorithms to its variant without the elements of stochastic optimisation. This confrontation revealed that stochastic optimisation approaches provide results of better quality for maximum clique, chromatic number, maximum independent set and minimum vertex clique covering.

Overlapping cliques were investigated using enumerative methods, too. This investigation suggests that some of the studied PPI networks have a hierarchical structure, with large overlapping cliques possibly consisting of smaller cliques. We also identified the dominating sets of these networks. In the context of PPI networks, these are the sets of “central” proteins such that all other proteins interact with at least one protein of the dominating set.

We believe that this approach might be beneficial especially in exploration of new biological networks. Most of the studied problems are closely related to functional module detection. However, unlike network clustering, studied characteristics are clearly defined and can be used as a systematic basis for further investigations.