Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Introduction

The majority of common diseases, common traits, and pharmacological drug response are genetically intricate, polygenic, multifactorial, and often result from an interaction of genetic, environmental, and physiological factors. Although high-throughput, genome-wide studies like linkage analysis and gene expression profiling are useful for classification and characterization, they often fail to provide sufficient information to identify specific disease causal genes or drug targets. Both of these approaches typically result in the identification of hundreds of potential candidate genes and cannot effectively reduce the number of target genes to a manageable figure for further validation.

4.2 Bioinformatic Tools for Gene Prioritization

Several computational approaches (Table 4.1) have been developed for gene prioritization to overcome the limitations of high-throughput, genome-wide studies like linkage analysis and gene expression profiling, both of which typically result in the identification of hundreds of potential candidate genes [13, 8, 10, 16, 59, 61, 62, 65, 76]. See recent reviews [7, 29, 43, 46, 50, 60, 64, 76] for technical and algorithmic details of various gene prioritization tools. While a majority of these tools are based on the assumption that similar phenotypes are caused by genes with similar or related functions [9, 20, 27, 55, 65], they differ by the strategy adopted in calculating similarity and by the data sources utilized [63]. Further, no single source of data can be expected to capture all relevant relations. For example, using coexpression data alone will fail to detect many effects of posttranscriptional modifications, while relying on protein–protein interaction data alone will fail to capture transcriptional regulation. Since these different data types are complementary, they need to be merged not only to improve coverage but to infer stronger relationships through the accumulation of evidence [43]. While this is true, except for Endeavour [3, 63] and ToppGene [9, 10], most of the existing approaches mainly focus on the combination of only a few data sources.

Table 4.1 List of current bioinformatics approaches and tools to rank human disease candidate genes

4.2.1 Functional Annotation-Based Approaches

The functional annotation-based candidate disease gene prioritization approaches are usually based on the guilt-by-association principle which asserts that reliable predictions about the disease involvement (“guilt”) of a gene can generally be made if several of its partners (e.g., genes with correlated expression profiles or protein interaction partners or genes involved in same biological process or pathway) share a corresponding “guilty” status (“association”) [43]. Incorporating the prior information or knowledge about a disease is thus critical for this type of approach. One of the fundamental challenges for these approaches is the ability to gather, normalize, and integrate heterogeneous data from multiple sources and keeping them current. There are now several online tools available which make carrying out such analyses intuitively without the need for having programming knowledge or direct support of a bioinformatics expert (see [29, 46, 64] for a list of such Web-based tools). While the usage of multiple heterogeneous data in the ranking makes the functional annotation-based approaches more thorough and less biased global assessment of candidate genes, they still suffer with a bias towards the training set and have some limitations. For instance, by using a training set, it is assumed that the disease genes yet to be discovered will be consistent with what is already known about a disease and/or its genetic basis. This assumption may not always be true. Additionally, since these approaches rely on known gene annotation, they tend to be biased towards selecting better annotated genes. For example, a “true” candidate gene can be missed if it lacks sufficient annotations. Thus, the effectiveness of this approach depends critically on how well the disease under investigation is defined both molecularly and physiologically. Second, it is important to note that the annotations and analyses provided, and the prioritization by these approaches, can only be as accurate as the underlying original sources from which the annotations are retrieved. For instance, only one fifth of the known human genes have pathway or phenotype annotations, and there are still more than 30 % genes whose functions are not well-defined. Third, using an appropriate or “true representative” training set is critical. For instance, in an earlier study, we observed that using larger training sets (>100 genes) decreases the sensitivity and specificity of the prioritization compared to smaller training sets (7–21 genes) [10]. Lastly, almost all of the current disease gene identification and prioritization approaches are coding-gene-centric, while it has been speculated that complex traits result more often from noncoding regulatory variants than from coding sequence variants [32, 35, 40].

4.2.2 Network-Based Approaches

A majority of the current computational disease candidate gene prioritization methods [13, 10, 16, 59, 61, 62, 65, 76] rely on functional annotations, gene expression data, or sequence-based features. The coverage of the gene functional annotations, however, is still a limiting factor. Currently, only a fraction of the genome is annotated with pathways and phenotypes [10]. While two thirds of all the genes are annotated by at least one functional annotation, the remaining one third has yet to be annotated. Interestingly, because biological networks have been found to be comparable to communication and social networks [28] through commonalities such as scale-freeness and small-world properties, the algorithms used for social and Web networks should be equally applicable to biological networks.

Recent biotechnological advances (e.g., high-throughput yeast two-hybrid screening) have facilitated generation of proteome-wide protein–protein interaction networks (PPINs) or “protein interactome” maps in model organisms and humans [53, 56]. Additionally, the shift in focus to systems biology in the post-genomic era has generated further interest in these networks and pathways. As a result, PPINs have been increasingly used not only to identify novel disease candidate genes [17, 30, 34, 73, 74] but also for candidate gene prioritization [8, 11, 34, 45, 73]. At the same time, network topology-based analyses hitherto used in social and Web network analyses have been successfully used in the identification and prioritization of disease candidate genes [8, 12, 19, 24, 34, 36, 54, 57, 70, 73]. Broadly, network topology-based candidate gene ranking approaches can be grouped into two categories: parameter-based and parameter-free methods. The parameter-based methods, such as PageRank with Priors (PRP [8]), Random Walk (RW [34]), and PRIoritizatioN and Complex Elucidation (PRINCE [70]), as the name indicates require additional auxiliary parameters that need to be trained by using available data sets. The PRP, for example, needs a parameter β to control the probability of jumping back to the initial node [8]. Similarly, the PRINCE algorithm uses a parameter to describe the relative importance of prior information [70]. However, selecting optimal parameters is often a challenge, and therefore the more “user-friendly” parameter-free approaches are preferred [24]. Further, most of the parameter-based approaches take into account the global information in the entire network, and thus they typically require extensive computation. For instance, in PRP, scores of all the vertices in the network need to be updated iteratively until they converge. This process tends to be slow and inefficient especially when the network size is large. The parameter-free methods (e.g., interconnectedness or ICN [24]), on the other hand, measure closeness of each candidate gene to known disease genes by taking into account direct link and the shared neighbors between two genes and therefore are relatively less intensive computationally. However, the performance of parameter-free methods was not comparable to those of parameter-based approaches. To address this, we recently developed a novel network-based parameter-free framework for discovering and prioritizing human rare disease candidate genes [75]. Our goals were to (a) enhance prioritizing performance compared to current parameter-free methods and (b) achieve a comparable performance to the parameter-based ones. Using several test cases, we compared the performance of our method (Vertex Similarity (VS)-based approach) to two approaches, one each from parameter-based (PRP) and parameter-free methods (ICN), and also used it to rank the immediate neighbors of known rare disease genes as potential novel candidate genes.

Network-based approaches using protein–protein interaction data while useful have some practical limitations [29]. First, high-throughput protein–protein interaction sets, especially yeast two-hybrid sets, are inherently noisy and may contain several interactions with no biological relevance [18, 26, 37, 66]. Surprisingly, only 5.8 % of the human, fly, and worm yeast two-hybrid interactions have been confirmed by the HPRD (Human Protein Reference Database), a manually curated compilation of protein interactions [47]. Second, the protein interactome tends to be biased towards well-studied proteins. Third, some of the human protein interactome data is derived by extrapolating high-throughput interactions from other species. Even though previous studies have shown that PPINs are conserved across species [25], there is a possibility for species-specific protein interactions. Fourth, two interacting proteins need not lead to similar disease phenotypes when mutated—for instance, they may have redundant or different but overlapping functions, or one may be more dispensable than the other [47]. Additionally, disease proteins may lie at different points in a molecular pathway and not necessarily interact directly. Fifth, disease mutations need not always involve proteins (e.g., telomerase RNA component in congenital autosomal dominant dyskeratosis) [47]. Lastly, most of the network topology-based algorithms were originally developed to identify “important” nodes in networks. Although extended versions of these algorithms are used to prioritize nodes to selected “seeds,” they could still be biased towards hubs.

4.3 ToppGene Suite: A One-Stop Portal for Candidate Gene Prioritization Based on Functional Annotations and Protein Interactions Network

In this section, we describe the ToppGene Suite (http://toppgene.cchmc.org) [810], a unique, one-stop online assembly of computational software tools that enables biomedical researchers to perform candidate gene prioritization based on (a) functional annotation similarity between training and test set genes (ToppGene) [10], (b) protein interactions network analysis (ToppNet) [8], and (c) identify and rank candidate genes in the training set interactome based on both functional annotations and PPIN analysis (ToppGeNet) [8]. The ToppGene knowledgebase combines 17 gene features available from the public domain. It includes both disease-dependent and disease-independent information in the nature of known disease genes, previous linkage regions, association studies, human and mouse phenotypes, known drug genes, microarray expression results, gene regulatory regions (transcription factor target genes and microRNA targets), protein domains, protein interactions, pathways, biological processes, and literature co-citations.

4.3.1 ToppGene: Functional Annotations-Based Candidate Gene Prioritization

In the first step, ToppGene generates a representative profile of the training genes using as many as 17 features and identifies over-representative terms from the training genes. Each of the test set genes is then compared to this representative profile of the training set, and a similarity score for each of the 17 features is derived and summarized by the 17 similarity scores. Different methods are used for similarity measures of categorical (e.g., GO annotations) and numeric (i.e., gene expression) annotations. For categorical terms, a fuzzy-based similarity measure (see Popescu et al. [51] for additional details) is applied, while for numeric annotation, i.e., the microarray expression values, the similarity score is calculated as the Pearson correlation of the two expression vectors of the two genes. The 17 similarity scores are combined into an overall score using statistical meta-analysis, and a p-value of each annotation of a test gene G is derived by random sampling of the whole genome. The p-value of the similarity score S i is defined as:

$$ p\left({S}_i\right)=\frac{ count\; of\; genes\; having\; score\; higher\; than\;\mathrm{G}\; in\; the\; random\; sample}{ count\; of\; genes\; in\; the\; random\; sample\; containing\; annotation}. $$

To combine the p-values from multiple annotations into an overall p-value, Fisher’s inverse chi-square method, which states that \( -2{\displaystyle \sum}_{i=1}^n \log {p}_i\to {\chi}^2(2n) \) (assuming the p i values come from independent tests) is used. The final similarity score of the test gene is then obtained by 1 minus the combined p-value. Additional details explaining the development of this method along with the validation process and comparison with other approaches have been previously published [9, 10].

4.3.2 ToppNet: Network Analysis-Based Candidate Gene Prioritization

ToppNet gene prioritization is based on the analysis of the protein–protein interaction network. Motivated by the observation that biological networks share many properties with social and Web networks [28], ToppNet uses extended versions of three algorithms from White and Smyth [72]: PageRank with Priors (PRP), HITS with Priors, and K-step Markov. The disease candidate genes (test set) are ranked by estimating their relative importance in the PPIN to known disease-related genes (training set). The PageRank with Priors, based on White and Smyth’s PageRank algorithm [72], mimics the random surfer model wherein a random Internet surfer starts from one of a set of root nodes, R, and follows one of the links randomly in each step. In this process, the surfer jumps back to the root nodes at probability β, thus restarting the whole process. Intuitively, the PRP algorithm generates a score that is proportional to the probability of reaching any node in the Web surfing process. This score indicates or measures the relative “closeness” or importance to the root nodes. The second algorithm is HITS with Priors, an extension of HITS (Hyperlink-Induced Topic Search) developed by Jon Kleinberg to rank Web pages. It determines two values for a page: “hubness,” representing the value of its links to other pages, and “authority,” which estimates the value of the content of the page [33]. Here, too, the surfer starts from one of the root nodes. In the odd steps he/she can either follow a random “out-link” or jump back to a root node, and in the even steps he/she can instead follow an “in-link” or jump back to a root node. As in the case of PRP, HITS with Priors also estimates the relative probability of reaching a node in the network. The third algorithm is the K-Step Markov method which mimics a surfer who starts with one of the root nodes and then follows a random link in each step before returning to the root node (after K steps) and restarts surfing. For additional details readers are referred to our original published study [8].

4.3.3 ToppGeNet: Prioritization of Disease Gene Neighborhood in the Protein Interactome

ToppGeNet allows the user to rank the interacting partners (direct or indirect) of known disease genes for their likelihood of causing a disease. Here, given a training set of known disease genes, the test set is generated by mining the protein interactome and compiling the genes interacting either directly or indirectly (based on user input) with the training set genes. The test set genes can then be ranked using either ToppGene (functional annotation-based method) or ToppNet (PPIN-based method).

4.4 Case Studies to Demonstrate the Utility of Computational Approaches for Human Disease Gene Prediction and Ranking

In the following sections we present two sets of case studies to demonstrate the utility of computational approaches in discovering and ranking novel candidate genes for human diseases. In an earlier study, Tiffin et al. [61] used some of the computational approaches for disease gene identification and prioritization and concluded that using the methods in concert was more successful in prioritizing candidate genes for disease than when each was used alone. Hence, in the first case study, we select ten diseases and use both functional annotations-based and network-based approaches to identify and rank novel candidate genes for these diseases. We used ToppGene [9] for functional annotation-based ranking, and for network-based ranking we used both parameter [8]- and nonparameter [75]-based approaches (see next section for details). In the second case study, we present two recent examples that demonstrate the power of using bioinformatics techniques with the exome sequencing technologies in identifying novel candidate genes for rare disorders.

4.4.1 Case Study 1: Identifying and Ranking Novel Candidate Genes for Ten Human Diseases

The workflow (Fig. 4.1) described here is based on a simulation of a researcher’s approach to selecting and ranking candidate disease genes. In this process, a variety of relevant database sources are mined for compiling both the training and test set genes. Known disease-associated genes for the ten selected diseases (from a recent review [43]) were obtained by combining gene lists from OMIM [21], the Genetic Association Database [4], GWAS [22], and diseases biomarkers from the Comparative Toxicogenomics Database [13] (see Table 4.2 for the list of selected ten diseases and their training sets or known causal genes). The test set or candidate genes to be ranked are compiled mining protein interactome and functional linkage networks. Briefly, for each of the training set genes (known disease causal gene), we extracted their interacting partners (both from the protein interactome and functional networks). The protein interactome data was downloaded from the NCBI (ftp://ftp.ncbi.nih.gov/gene/GeneRIF/interactions.gz), while for functional networks, we used two sources: (a) Functional Linkage Network (FLN) [38] and (b) STRING (score ≥ 700) [58]. Thus, for each disease, we compiled three test sets using the three databases.

Fig. 4.1
figure 1

Panel (a) shows schematic representation of the workflow for identifying and ranking novel disease candidate genes using functional annotation- and network-based approaches. Candidate genes are compiled using both protein interactions and functional associations (Functional Linkage Network and STRING). The candidate genes are ranked using both functional annotations (ToppGene) and network topology (PageRank with Priors and Vertex Similarity-based approaches). The final ranks are generated by taking the harmonic mean of the ranks of a gene from the three methods (ToppGene, PRP, and VS). Panel (b) shows the top-ranked genes for congenital diaphragmatic hernia using functional annotation- and network-based approaches. Highlighted genes (LRAT, ZFPM2, NKX2-5, and PDGFRB) represent those that have been ranked among top ten by different approaches

Table 4.2 Top-ranked novel candidate genes for ten select diseases

The test sets were then ranked by three approaches: (a) functional annotations-based ranking (using ToppGene), (b) PageRank with Priors (parameter-dependent network topology-based approach), and (c) Vertex Similarity (parameter-free network topology-based approach). We used the harmonic mean of the individual ranks from the three approaches to obtain the final-ranked list. We repeated the same process for two other test sets obtained from functional networks (FLN and STRING). In the final step, we intersected the top ten genes from the three networks (PPIN, FLN, and STRING) to see the intersection. The last column in Table 4.2 shows those genes that are ranked among the top ten in the three networks. For example, in congenital diaphragmatic hernia (CDH), four genes (LRAT, ZFPM2, NKX2-5, and PDGFRB) were ranked among top ten in all the three networks. Interestingly, the retinol status in newborns is associated with CDH, and genetic analyses in humans suggest a role for retinoid-related genes in the pathogenesis of CDH [6]. LRAT (lecithin retinol acyltransferase) ranked among the top mediates cellular uptake of retinol and plays an important regulatory role in cellular vitamin A homeostasis [31]. Similarly, Wat et al. [71] identified three unrelated patients with CDH who had a heterozygous deletion of chromosome 8q involving ZFPM2, which was ranked among the top five in the three networks. It is beyond the scope of this chapter to discuss about the top-ranked genes for all the ten diseases. The supplementary file (Supplementary File 1) shows the complete lists of training and ranked test set genes for the ten select diseases along with the details of rankings from each of the three approaches using three different networks (PPIN, FLN, and STRING).

4.4.2 Case Study 2: Exome Sequencing and Bioinformatics Applications to Identify Novel Rare Disease Causal Variants

In the following sections we present two examples from recently published studies [5, 14] where computational approaches for candidate gene ranking were used in concert with exome sequencing to identify novel disease causal variants.

The first example [14] illustrates the potential of combining genomic variant and gene level information to identify and rank novel causal variants of rare diseases. Combining computational gene prediction tools with traditional mapping approaches, Erlich et al. [14] demonstrated how rare disease candidate genes from exome resequencing experiment can be successfully prioritized. In this study, a familial case of hereditary spastic paraparesis (HSP) was analyzed through whole-exome sequencing, and the four largest homozygous regions (containing 44 genes) were identified as potential HSP loci. The authors then applied several filters to narrow down the list further. For instance, a gene was considered as potentially causative if it contains at least one variant that is either under purifying selection or not inherited from the parents or absent in dbSNP or the 1,000 Genomes Project data. Because majority of the known rare disease variants affect coding sequences, the authors also checked if the variant is non-synonymous. After this filtering step, 15 candidate genes were identified and this list was further prioritized using three computational methods (Endeavour [3], ToppGene [9], and Suspects [2]). As a training set, a list of 11 seed genes associated with a pure type of HSP was compiled through literature mining. Interestingly, the top-ranking gene from all the three bioinformatics approaches (each of which uses different types of data and algorithms for prioritization) was KIF1A. Subsequent confirmation of KIF1A as the causative variant was done using Sanger sequencing.

In the second example, Benitez et al. [5] used disease-network analysis approach as supporting in silico evidence of the role of the adult neuronal ceroid lipofuscinosis (NCL) candidate genes identified by exome sequencing. In this case, the authors used Endeavour [3] and ToppGene [9] to rank the NCL candidate variant genes identified by exome sequencing. Known causal genes of other NCLs along with genes that are associated with phenotypically close disorders were used as training set. Interestingly, the three variants identified by exome sequencing (PDCD6IP, DNAJC5, and LIPJ) were among the top five genes in the combined analysis using ToppGene and Endeavour, suggesting that they may be functionally or structurally related with NCL encoded genes and constituting true causative variants for adult NCL.

4.5 Final Remarks

The selection of “best” computational approach for identifying and ranking disease candidate genes is not an easy task and depends on several various factors. Since a majority of these approaches are based on guilt-by-association principle, having a “good” or representative training set is critical. The training set may not necessarily be always a set of known causal genes but can be an implicated pathway or biological process or even a list of symptoms (or phenotype). Additionally, prior knowledge can sometimes be also inferred from related or similar diseases. This similarity can be either similar manifestation or symptoms or similar molecular mechanisms of related or similar diseases. Second, selecting an appropriate approach is also important and frequently depends on the disease type and the molecular mechanism that causes it. For example, using protein–protein interaction data for identifying novel candidates may be useful when a disease is known to be caused by the disruption of a larger protein complex. On the other hand, using a protein interaction network may not be totally justified for a disease known to be caused by aberrant regulatory mechanisms. In such cases, either using gene regulatory networks and/or high-throughput gene expression data may be more apt [50]. Third, since several previous studies have shown that the computational approaches for disease gene ranking are largely complementary [5, 14, 44, 61], we recommend using a combination of at least two different approaches (e.g., functional annotation-based and network topology-based approaches).