Keywords

1 Introduction

Linkage studies for determining relevant genes for specific human diseases can point to a genomic region containing hundreds of genes, while the high-throughput sequencing approach will often identify a great number of non-synonymous genetic variants. Although the detection of potentially deleterious variants can be easily automated, this can often result in the identification of thousands candidate disease genes. Since the experimental verification of an individual gene can be both difficult and time consuming, an efficient way to reduce the validation cost is to narrow down the large list of candidate genes to a small and manageable set of promising genes; a process called gene prioritization (GP).

As manual examination of biological databases in order to select the most promising causative genes for the disease of interest has been only partially successful, since the selection is based solely on the subjective impressions of the researcher and genetic disorders often involve several primarily responsible genes, various computational GP methods have been proposed for this purpose.

Earlier works [1] investigated gene-diseases associations based on gene expression profiles or genome wide association studies (GWAS). Genome-wide association studies identify genes involved in human disease by searching the genome for small variations, called single nucleotide polymorphisms (SNPs), that occur more frequently in people with a particular disease than in healthy people. Each study can look at hundreds or thousands of SNPs at the same time. However, this approach tends to produce many false positive results, and the experimental validation of these candidate genes, for instance through resequencing, pathway or expression analysis, is still expensive and time consuming [2].

For these reasons other GP approaches have been investigated, such as guilt-by-association (GBA), in which candidate disease genes are ranked by exploiting the assumption that similar genes tend to share similar diseases [3]. The input of these methods is represented by gene networks, in which nodes represent genes and connections encode precomputed functional relationships among genes, such as common functional annotations (e.g. Gene Ontology annotations [4]), transcriptional co-expression regulation, direct molecular interactions [5]. In this context, many approaches have been adopted to compute the GP ranking, ranging from protein-protein interaction network analysis and semi-supervised graph partitioning [5], to flow propagation [6], and random walks [7].

To improve the accuracy of GP methods, recent studies have investigated the advantage of integrating multiple data sources, including expression profiles, SNP genotype data, expression quantitative trait loci, functional profiles, and network-based sources, such as gene-chemical networks, protein complexes and genetics/physical interactions [8]. A general approach in data source integration ranks each candidate gene according to each individual data source using various metrics, and then combine ranks from all data sources by using order statistics to obtain an overall rank [3]. For network-based integration approaches, a consensus network is constructed by combining the structure and the characteristics of each network, through different network integrating strategies [9]. The consensus network tends to provide better signal-to-noise ratio and complementary information about genes, thus leading to an improvement in prediction accuracy in most of cases [9, 10].

Apart from the disadvantages and the benefits discussed above for each different approach, the main drawback shared by the above-mentioned GP methods is that they completely neglect the class imbalance problem characterizing GP: there are much fewer causative genes (the positive instances) than non-causative ones (the negative instances). For instance, around \(40\,\%\) (10/09/15 update) of known genetic diseases in the OMIM (Online Mendelian Inheritance in Man) database have still fewer or almost none established gene-disease associations [11]. Computational methodologies usually suffer from a drastic performance deterioration in case of imbalance classes, since algorithms tend more to focus on the classification of major class samples while ignoring or misclassifying minority class samples [12]. Unfortunately, in our context the minority class carries almost all the information we have about the disease under study, and this makes necessary the adoption of specifically designed imbalance-aware machine learning algorithms, often referred to as cost-sensitive. For instance, cost-sensitive techniques obtained successful result in similar contexts, e.g. in the protein function prediction [13, 14].

Here we propose a novel network-based approach for detecting disease-gene association which aims at coping with the label imbalance by ‘transforming’ the input network so as to effectively represent the label imbalance, and by applying cost-sensitive methodologies on the obtained network representation. In particular, our procedure can be summarized as follows: (1) by following the approach proposed in [15], the input network is projected onto a bidimensional space, where each labeled input node corresponds to a labeled point whose coordinates depend on its positive and negative neighborhood in the input network, respectively; (2) the obtained couple of coordinates/features for each point are given in input to a cost-sensitive family of regressors to learn an cost-sensitive model to rank the unlabeled nodes. The node projection at Step 1 embeds the imbalance between positive and negative genes at each neighborhood in the corresponding point position. Moreover, working with just two features makes the Step 2 of our procedure very fast, thus allowing our method to efficiently handle large data sets. Finally, the method in general enough to include strategies for integrating heterogeneous network sources in a dedicated preprocessing step, so as to exploit the benefit of working with more reliable and informative networks. We experimentally validated our method on a public benchmark data set for GP, including almost nine thousands of human genes and around seven hundreds diseases collected from the Medical Subject Headings databaseFootnote 1.

The paper is organized as follows: in Sect. 2 we formalize the problem, while Sect. 3 is devoted to describe both the gene networks and the network integration techniques adopted in the benchmark experimental setting. In Sect. 4 we introduce our proposed two-step procedure; then in Sect. 5 we check its effectiveness by comparing its performance with state-of-the-art methodologies. Finally, Sect. 6 concludes the paper.

2 Problem Setting

The disease-gene prioritization problem can be seen as a semi-supervised bipartite ranking problem on undirected graphs [16]. Specifically, a gene network can be represented through an undirected weighted graph \(G=(V,\varvec{W})\), where \(V=\{1,2,\ldots ,n\}\) is the set of vertices corresponding to genes, and \(\varvec{W}\) is the \(n\times n\) weight matrix, where each element \(W_{ij}\in [0,1]\) represents some notion of functional similarity between vertices i and j. Vertices in V can be partitioned into two subsets: \(S\subset V\) containing instances labeled according to a specific MeSH subject heading, and its complement \(U = V \setminus S\), including unlabeled instances and therefore representing the object of our inference. As for the former, the set of positive/negative instances are denoted respectively with \(S_+\) and \(S_-\).

The task we are called to solve consists in learning a ranking function \({\phi :U\rightarrow \mathbb R}\) that assigns values to future positive instances higher than to negative ones, ranking therefore the former higher than the latter. From this standpoint, GP is cast as a semi-supervised learning problem on graphs, since gene ranking can be inferred by exploiting both labeled and unlabeled nodes (genes) and the connections among them.

To make the problem even harder, the family of graphs under investigation is subjected to a strong imbalance between negative and positive instances, presenting a strong disproportion in favor of negative labeled nodes.

Table 1. Gene networks used in experimental campaign.

3 Materials

The input connection matrix \(\varvec{W}\) represents a complex set of interactions or similarities between genes and/or their products (such as proteins), obtained as integration of several heterogeneous data sources. We adopt the benchmark experimental setting proposed in [9], which is composed of nine human gene networks covering 8449 genes, and describing functional interactions, transcriptional co-expression/regulation and localization, gene expression profiles, genes-chemicals relationships, protein-protein physical and genetic interactions, and GO semantic similarity (see Table 1). The database also provides the associations of such genes with 708 selected MeSH (Medical Subject Headings) diseases, downloaded from the CTD database [17]. The selected MeSH disease terms include between 5 and 200 causative genes.

3.1 Network Integration

Since the various networks have different number of genes, before their combination we extend them to the union of genes in the single networks, by filling each network with zeros in the corresponding missing rows/columns. As done in [9], in a pre-processing step we delete smaller edges so as to remove too small (and putative noisy) similarities, and ensure at least one neighbor for each node.

As integration scheme we adopt the unweighted integration of single networks, which performed better among the unweighted schemes proposed in [9]. It is the simple average of the m available network adjacency matrices, i.e. \(\varvec{W}^* = \sum _{d=1}^m{\varvec{W}^{(d)}}/m\). Finally, we apply to \(\varvec{W}^*\) the Laplacian normalizazion \(\varvec{D}^{-\frac{1}{2}} \varvec{W}^{*} \varvec{D}^{-\frac{1}{2}}\), where \(\varvec{D}\) is a diagonal matrix \(D_{ii}|_{i=1}^n\), with \(D_{ii} =\sum _j \varvec{W}^*_{ij}\).

We performed our experimentations on two networks: the first, called Net6 hereinafter, was obtained by integrating the six gene networks with the exclusion of the semantic similarity-based ones; the latter (Net9) is defined as unweighted integration of all the nine single networks reported in Table 1.

4 Methods

We decided to solve the bipartite ranking problem introduced in Sect. 2 in terms of a generalized linear model (GLM) where the response variable, suitably thresholded through the sign function, decrees the membership to either the positive or negative class, while the predictors have been chosen so as to exploit the nodes similarity coded in the weight matrix \(\varvec{W}\). In order to keep the computational burden low and to exploit the network topology, we extract from the input network two featuresFootnote 2, as follows: each node \(i\in S\) is associated with a point \(\varDelta _i=(\varDelta _i^+,\varDelta _i^-)\) in the plane, where

$$\begin{aligned} \varDelta _i^+ = \sum _{j\in S_+} w_{ij},\quad \quad \varDelta _i^- = \sum _{j\in S_-} w_{ij} \end{aligned}$$
(1)

Intuitively, the more node i is functionally similar to positive nodes and the higher will be the value of its \(\varDelta _i^+\) coordinate; analogously for the contribution given by negative nodes to the second coordinate. Remembering the one-to-one correspondence between genes and vertices, with this projection we hope to find a bipartition of nodes in S which concentrates positive nodes mostly toward the rightmost lower region of the first quadrant, and negative ones in the remaining portion of it. This network projection onto the plane, already adopted in [15], also allows to both avoid the curse of dimensionality problem, since the projected space has just two dimensions, and deal with the class imbalance problem, since the projected positive and negative two-dimensional points can be associated with different misclassification costs during the learning of the GLM.

Table 2. GLMs adopted in the experimental campaign.

We adopted the four GLM models, described in Table 2. Within the various cost-sensitive schemes proposed to allow regression models handling imbalance classes [25], one of the most effective is maximum weighted likelihood estimation [26], which consists in maximizing the sum of the log-density of each sample item, suitably weighted by a coefficient \(\omega \in \mathbb R_0^+\): the higher the coefficient and more influential will be the corresponding sample point in the overall optimization. Here we propose two variants of the above vanilla regression models, by introducing two weighting schemes \(\varvec{\omega }^a\) and \(\varvec{\omega }^b\), as follows. Having denoted with \(n_+\) and \(n_-\) respectively the number of positive and negative instances:

$$\begin{aligned} \omega ^a_i={\left\{ \begin{array}{ll} 1/{n_+} &{} \text {if gene}\ i\in S_+ \\ 1/{n_-} &{} \text {otherwise} \end{array}\right. }\quad \quad \omega ^b_i={\left\{ \begin{array}{ll} \varDelta _i^+/\sum _{j\in S_+} \varDelta _j^+ &{} \text {if gene}\ i\in S_+ \\ 1/{n_-} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

Intuitively, both schemes try to compensate the class imbalance by giving higher weights to infrequent instances. Scheme ‘b’ breaks the flatness of positive weights by assigning higher influence to positive nodes when they are functionally more similar to nodes belonging to the same class. In other words, the higher is the positive neighborhood of a positive node and the higher will be its influence in the overall maximization process.

Summing up all possible combinations of GLMs and weight schemas, we obtained a total of 12 models, which we refer to with the schema “[W]GP-mod [ws]”, where WGP stands for Weighted Gene Prioritization, mod is one of the four GLM acronyms used in Table 2, the weights schema ws \(\in \) {‘a’,‘b’}, and square brackets are used to denote optional arguments.

5 Results and Discussion

In order to have a fair comparison, the experimental validation of the proposed models follows the setting adopted in [9]. We compared our method with the state-of-the-art techniques briefly described in Table 3, and estimated the generalization performances by averaging the performances observed through the classical k-fold cross-validation (CV), with \(k = 5\). Performances have been assessed using both the Area Under the ROC Curve (AUC) and the Precision at different Recall levels (PXR). Concerning the experimental campaign, as performed in [9], firstly we run our methods on the network Net6 (see Sect. 3.1). Table 4 shows the corresponding average AUC sorted in decreasing order. Apart from GP-PR and GP-CLogLR, all our methods outperform the top-performing benchmark algorithm (\(\mathrm {S}_{\mathrm {AV}}\ t = 5\)). This witnesses the high informativeness of the two projected features defined in Eq. (1) and the effectiveness of the GLM to cope with the label imbalance at each node neighborhood. Moreover, to appreciate the benefit of the cost-sensitive models w.r.t. the corresponding vanilla versions, we performed the one-side Wilcoxon Signed Rank test between all couples of methods within the same family to assess whether their population mean ranks differ. As a results, we observed a meaningful increase in performance of the ‘b’ scheme over the ‘a’ variant – confirming our initial assumption that positives, carrying more information than negatives, should be taken into account when learning the predictive model – and singularly of both cost-sensitive models w.r.t. their vanilla version (\(p\)-value \(<0.001\)). This regularity breaks down in both linear and Poisson regressions, where scheme ‘a’ outperforms variant ‘b’ (\(p\)-value \(= 0.025\)). We conjecture that such results are due to the convergence of GLM fitting procedures toward spurious optima in rare instances which, in turn, may be caused by the peaked landscape of weights distribution in ‘b’ scheme. Finally, due to the fast convergence of regression performed in the 2-dimensional projected space, our method is also scalable, taking around 5 seconds to perform the entire 5-fold CV procedure for a single MeSH disease on a Intel i7-860 CPU 2.80 GHz machine with 16 GB of RAM.

Table 3. Competitor benchmark methods.
Table 4. Average AUC across MeSH terms on the network Net6.
Fig. 1.
figure 1

Paired AUC comparison between cost-sensitive ‘b’ schema and the corresponding vanilla version for: (a) cloglog and (b) logit link functions.

To better investigate the improvements achieved by cost-sensitive methods, we show in Fig. 1 the paired AUC obtained by vanilla regression and the corresponding cost-sensitive ‘b’ version for cloglog and logit link functions (similar trends were obtained for all other paired comparisons – results not shown). It is immediate to observe a large majority of bullets lying above the bisector, showing that cost-sensitive variants achieve higher AUC values for most of the considered MeSH diseases. Indeed, in the first two columns of Table 5 we report the proportion of MeSH terms where cost-sensitive methods outperform the corresponding vanilla ones. Such proportion ranges from \(70.1\,\%\) to \(85.9\,\%\).

Table 5. Proportion of wins (in terms of AUC) of cost-sensitive vs. cost-insensitive methods, observed over all the considered MeSH terms.

Similar AUC results are obtained when running the proposed methods over the network Net9, as reported in Table 6. Results obtained by GBA, RW and RWR methods are not reported in the referenced papers due to their low performances. All our methods (except for GP-CLogLR) perform better than the top-performing benchmark method (\(\mathrm {S}_{\mathrm {AV}}\ t = 5\)). Note how the best method (GP-CLogLR b) makes more noticeable the gain due to the adopted cost-sensitive approach, ranking the correspondent vanilla version at thirteenth place. The Wilcoxon Signed Rank test confirms the results observed for the network Net6, with some exceptions. Firstly, we observe no meaningful differences between both the two cost-sensitive variants of the Poisson model, and ‘a’ schema with its naive version (\(p\)-value \(>0.05\)). Moreover, the only model privileging the vanilla variant w.r.t. its ‘a’ schema counterpart is the logistic one (\(p\)-value \(<0.001\)). The exceptional nature of such an event is confirmed by the entries reported in the last two columns of Table 5: despite the less pronounced proportion of wins of cost-sensitive methods over their cost-insensitive variants than those observed for network Net6, six out of eight entries still shows a remarkable disproportion in favor of cost-sensitive schemas.

Table 6. Average AUC across MeSH terms on the network Net9.
Fig. 2.
figure 2

Performance distribution of the proposed methods across MeSH terms for: (a) network Net6, and (b) network Net9. Box colors depict the performance ranking of each method, as explained by the legend reported below the graphs. In such setting, boxes sharing the same colors represent indistinguishable methods.

To better analyse the AUC distributions over MeSH diseases, in Fig. 2 we report the box-and-whiskers plot of all proposed methods. Boxes are colored so as to reflect the ranking of the methods, obtained by performing all pairwise comparisons under the one-side Wilcoxon Signed Rank test. Models sharing the same color represent maximal sets of indistinguishable methods under the above test with 0.05 significance level. The darker the color, the worst is the ranking, as shown in the legend under the picture. In particular, all methods ranking fourth downward are joined together in the same class, for the sake of visualization. Apart from the already discussed over-performance of cost-sensitive methods, we appreciate both a smaller variance and a reduced presence of outliers (not shown in the pictures). It is worth noting the marked skewness toward lower AUC values in all experiments, as confirmed by the fact that the means of AUC distributions (black markers in the pictures) are always lower than their medians (depicted with notches). Evaluating performances through means, as done in Tables  4 and 6, strongly penalizes all methods, being mean values strongly affected by the presence of outliers having low AUC values. To guarantee a fair comparison with benchmark results, we still make use of such estimator, noting that median values give a more informative and less biased view of the overall performances.

Fig. 3.
figure 3

PXR levels achieved on the network: (a) Net6, and (b) Net9.

We conclude this analysis by showing in Fig. 3 the PXR results for recall levels ranging from 0.1 to 1, with steps of 0.1. Undoubtedly, the performances of the proposed methods are very close each others; for this reason, for the sake of readability, we report just the results for vanilla and cost-sensitive ‘b’ scheme methods, since ‘a’ scheme achieves almost indistinguishable results. To better appreciate the advantage of working with cost-sensitive methods, we use a light gray level for all vanilla methods, and a dark gray one for their cost-sensitive variants. Apart from the slight but always remarkable behavior of the latter, we observe a noticeable gain w.r.t. \(\mathrm {S}_{\mathrm {AV}}\ t = 2\), the only method of which authors published the PXR performances, with the exception of precision value at a recall level of 0.1 in picture (b), where light and dark gray lines are almost overlapped, apart from GP-LR method which performs slightly worse. Note that in Fig. 3(a), vanilla methods tend to be more accurate for lower levels of recall. Nevertheless, for all the remaining recall values, in particular in the range [0.3, 1], cost-sensitive methods always outperform cost-insensitive ones.

6 Conclusions

In this work we propose a novel approach for gene-disease prioritization which is specifically designed to deal with imbalanced data, such as those characterizing databases of seed genes for known human diseases. We have shown that imbalance-aware methods can noticeably improve the performance in detecting gene-disease associations, evaluating the effectiveness of the proposed approach on a larged sized benchmark for gene prioritization problem. Future works will be devoted to exploit the hierarchical contribution coming from ontologically related gene classes.