Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The prize-collecting Steiner tree is a well known problem in combinatorial optimization and graph theory. Within the concept of the PCST, given an undirected network \(G = (V, E)\), where nodes are associated with prizes \( p_{j} \ge 0 \) and arcs are associated with costs \(c_{e} > 0\), the goal is to construct a sub-graph \(G' = (V', E')\) that has a tree structure. The researchers have studied different variants of the PCST problem in the literature. One of the broadly studied variant is known as Goemans–Williamson Minimization [1], where the objective is to identify a tree for a given graph by minimizing the total cost of arcs in a tree and minimizing the total prize of nodes excluded from the tree. This corresponds to the minimization of the following expression:

$$\begin{aligned} GW(G')= \sum _{e{\in }E'} c_{e} + \sum _{v{\not \in }V'} p_{v} \end{aligned}$$
(1)

The PCST has been successfully applied to model several real-world problems in utility networks. Recently, researchers have realized its application to biological networks for discovering the hidden knowledge [2]. Based on this idea, we have applied the PCST to gene interaction networks, where nodes correspond to genes and arcs represent the mutual information between genes. The PCST potentially captures the portion of graphs where genetic aberrations and mutations are highly present. Basically, biological interaction networks are large in size, and this can be remarkable challenge for existing PCST methods. By considering this fact in our previous studies, we have developed methods for the PCST that efficiently scale up to large biological network instances for analyzing the function of genes. In this work, we extensively test previously developed methods on generated gene interaction networks, and compare their performance on large networks.

2 Related Work

The pioneering work was performed by [3] in the PCST literature. The node weighted Steiner tree problem was proposed in [4], in which the specific set of nodes have to be covered by output tee. The state-of-the-art exact methods were presented in [5, 6], where the PCST was formulated by means of mixed integer linear programming (MILP) and a branch-and-cut algorithms was employed to solve underlining MILP. Some heuristic and matheuristic algorithms were studied in [1, 7, 8].

There some studies in the literature [2, 9,10,11] that already applied the PCST for functional analyses of protein interaction networks. As a result of these studies, the authors identified unknown functions of some proteins. They validated their computational findings by biological experiments. This shows the potential of the PCST to generate promising results while analyzing interaction networks.

3 Methodology

Usually, biological interaction networks are complex and huge in size. The PCST belongs to the class of NP-hard problems, where it is time consuming to obtain solutions for large graphs. This was the primary limiting factor for available PCST methods being applied on gene interaction networks. To enable the application of the PCST on biological networks, we have developed a heuristic and a matheuristic solution methods in our previous studies. The methods are shortly outlined in the following subsection.

3.1 The MST-Based Heuristic

This heuristic method is based on the iterative solution of Minimum Spanning Tree (MST) problems. Given an undirected network \(G = (V, E)\) and a user-defined parameter \(\alpha \), the heuristic constructs a complete network \(G_{1} = (V_{1}, E_{1})\) within the first iteration, where \(V_{1}:{v}\) only composed of nodes with \(p_{v} > \alpha \) and \(E_{1}:{(i,j)}\) corresponds to the shortest path distance between nodes i and j. The algorithm starts solving the problem by considering the nodes with prize \(p(v)\ge \alpha \) at the first iteration. Afterwards, the algorithm solves a MST on \(G_{1}\) and obtains a tree \(T_{1} = (V_{1}^{'}, E_{1}^{'})\) with the cost of C1. In the second iteration, the heuristic constructs next complete network \(G_{2} = (V_{2}, E_{2})\), where \(V_{2}:{v}\) formed by all nodes of tree from previous iteration \(v \in V_{1}^{'}\), and \(E_{2}:{(i,j)}\) corresponds to the shortest path distance between nodes i and j. Again, the algorithm computes a MST on \(G_{2}\) and obtains a tree \(T_{2} = (V_{2}^{'}, E_{2}^{'})\) with the cost of C2. If \(C2 \ge C1\), the algorithm terminates. Otherwise, the heuristic continues generating complete graph and solving MST problems until the cost of current tree gets bigger or equal to the cost of the previous tree. Then, the algorithm prunes the leaf nodes of the tree in order to further decrease the cost, and obtains final solution. The interested reader may refer to [12] for further details of the heuristic method.

3.2 The Clustering Matheuristic

The matheuristic algorithm was devised by combination of a heuristic clustering algorithm and an exact PCST solver. The main idea of the matheuristic was to divide the large graph into smaller graph clusters, and to solve each cluster separately using exact solver. The heuristic clustering algorithm clusters the nodes according to the all-pairs shortest path distance. Then, smaller graphs are constructed by inducing the nodes in the same cluster. Every smaller graph is solved by using exact PCST solver. Important to note that any exact solver could be used as inner solver at this stage. We have adapted the method proposed by [5] to our approach, and used it as an exact solver due to its efficiency. In [5], the PCST was formulated by MILP, and a branch-and-cut algorithm was proposed to solve the formulation. The interested reader may refer to [13] for further explanation of the matheuristic method.

4 Experimental Results

In this section, we test the MST-based heuristic and the clustering matheuristic method on large gene interaction network instances, and compare their performance. The benchmark instances are generated based on gene expression profiling data of Diffuse Large B-Cell Lymphoma (DLBCL) cancer patients available online in Gene Expression Omnibus repository.Footnote 1 There two subtypes of DLBCL cancer that are: the germinal center B cell (GCB) and an activated B cell (ABC). The goal is to identify a set of genes that are relevant for subtype classification. The networks are generated by using the multiplicative model of ARACNE [14] algorithm, which is a powerful tool for the reconstruction of gene interaction networks. ARACNE uses the mutual information among genes for the network reconstruction. We used two parameters (\(eps=\textit{0.01}\), \(eps=\textit{0.05}\)) fed into ARACNE in order to generate the test instances. In these networks, every arc represents the interaction between two genes and its weight is labeled as the pairwise correlation of expression values of genes. Each node is labeled with prize \(p_{v} = |E_{ABC}-E_{GCB}|\), where \(E_{ABC}\) and \(E_{GCB}\) are the mean value of gene expression of ABC and GCB cancer patients for corresponding gene, respectively. All of the nodes have positive prize \(p_{v} >0\) in generated instances.

Table 1 Results of DLBCL test instances generated with the parameter \(eps=\textit{0.01}\)

The computational experiments have been performed on a machine equipped with an Intel(R) Xeon(R) CPU E5320 1.86 GHz processors and 32 GB of shared memory. A single core was used for the experiments.

Table 1 summarizes the results of both methods for gene interaction network instances generated with the parameter \(eps=\textit{0.01}\). The first three columns of the table show the names and the sizes of test instances, respectively. From the fourth to the ninth columns we report the objective values and running times of the MST-based heuristic method [12], in which the algorithm employs different values for the parameter \(\alpha \). The tenth and eleventh columns present the objective values and execution times of the clustering matheuristic method [13].

Table 2 Results of DLBCL test instances generated with the parameter \(eps=\textit{0.05}\)

Table 2 delivers the results of the MST heuristic and clustering matheuristic methods for interaction network instances generated with the parameter \(eps=\textit{0.05}\).

According to the results of the tables, both methods, the MST heuristic and clustering matheuristic, were able to provide solutions in a reasonable time. The solutions obtained by the clustering matheuristic are considerably better than the MST heuristic in terms of solution cost for these instances. The MST heuristic also was able to obtain good quality solutions, and the running times of the instances are improved by decreasing the parameter \(\alpha \). The primary reason for elaborated execution times is the decay in parameter \(\alpha \), where the larger set of nodes are considered in computations during the first iteration. The general pattern of the solution cost is decreased by lowering the \(\alpha \) from 0.5 to 0.3 and 0.1, however, lowering the \(\alpha \) to 0.0 did not improve the cost further. The main reason for this is the MST heuristic was designed for large networks that have smaller number of positive nodes \(p_{v} >0\). In contrast, the clustering matheuristic method was developed for large networks where most of the nodes have positive prizes \(p_{v} >0\). The parameter \(\alpha \) can be used to tune a trade off between the quality and running time for the MST heuristic. The \(\alpha \) can be set to a reasonably higher value in order to analyze large interaction networks fast, and also not losing too much from the optimality.

5 Conclusions

In this study, we have compared a MST-based heuristic and a clustering matheuristic methods developed for large prize-collecting Steiner tree problems generated from real biological data describing gene interaction networks. Experimental results support that the performance of the clustering matheuristic is better than the MST heuristic method in terms of solution quality for the interaction network instances, however, MST heuristic also can be used to analyze large interaction networks in a quick manner by tuning the \(\alpha \) parameter.