1 Introduction

The advancement in technology has uncovered various truths about life. Bioinformatics is one such field that uses advances in the areas of information technology, computer science and communication technology to handle the complicated problems of life sciences, more specifically in biotechnology. Thus, using computational techniques to solve such problems has fast-forwarded the process.

Human beings are made up of genes that are responsible for the functionality and characteristics of the individual. Human being acquires a total arrangement of qualities from each parent, and additionally a tremendous exhibit of social and financial encounters from his/her family. Since relatives speak to most firmly comparable genomic associations and share the same natural encounters along these lines’ family history is frequently observed as one of the most grounded elements for some diseases, for example, cardiovascular diseases, diabetes, cancer, and so on.

Uncovering these disease-causing genes among the candidate genes is thus a major step towards the pathogenesis of diseases. In research after some point in their analysis researchers filter which genes to consider for further experimentation and which one to drop. Disease gene prioritization aims at selecting these disease-causing genes among the large pool of candidate genes by computational examination of both open and private genomic information. Its principle objective is to maximize the yield by narrowing down the gene set to be considered and thus helping in focus on the most promising candidate genes. Thus, disease gene prioritization helps determine the functionality of genes, early diagnosis of disorders and designing effective strategies for treatment.

Disease gene prioritization provides various benefits such as:

  • It can help in the development of tools that will provide personalized information to the individual about the risk of developing a certain disease.

  • It helps health professionals to design prevention programs for each person according to their mark-up.

  • It allows effective selection of the treatments for the patients which are most likely to help them without causing any adverse reactions.

Identifying disease-causing genes and the functionality of various components of genes is a key challenge. Various high-throughput techniques such as genome wide association studies (GWAS) (Hirschhorn and Gadjos 2011), (Ott et al. 2011), linkage analysis, large-scale RNA interference screens (Boutros and Ahringer 2008) are being developed to distinguish disease-causing genes but these mostly resulted in a long list of candidate genes. Subsequently, these experimental methodologies are very time-consuming and expensive as they require several resources. For these various computational systems are being produced for prioritization of hopeful genes. These strategies work by performing estimations iteratively and some additionally coordinate complex, heterogeneous information sets, for example, sequence information, expression information, functional annotation, and other biomedical writing to give a rundown of organized gene list which would be useful for future review in a more educated manner.

In this review, different computational techniques have been discussed with the main attention on network-based disease gene prioritization. In the second section, various types of disease gene prioritization have been discussed. In the third section, various categories of network-based techniques are highlighted; with integration with heterogeneous data types discussed in the fourth section. In the fifth section of the paper, the workflow of the network-based disease gene prioritization process has been explained. In the sixth section discussion about techniques and problems that arise in the gene prioritization process are discussed.

2 Disease gene prioritization techniques

On the basis of strategy and type of data used, various computational techniques for candidate gene prioritization can be extensively classified into the under-mentioned categories: filtering-based techniques, similarity-based techniques, and network-based techniques. Table 1 lists some of the advantages and disadvantages of these techniques.

Table 1 Advantages and disadvantages of different types of gene prioritization techniques

2.1 Filtering-based techniques

Filtering-based techniques work on the basis of the properties of genes. The properties of genes are defined and then filters are created. With the application of each filter, the list of candidate genes gets reduced leaving at the end the smaller output list of candidate genes that have a higher probability of being linked with the disease. For example, a tool named ‘TEAM’ (Tool for the integration of Expression, and linkage and Association Maps) (Franke et al. 2004) have three-level filtering options for analysis of genes: candidate gene region analysis, whole chromosome analysis, and gene-specific analysis. Filtering-based techniques are also used in combination to other techniques for narrowing down the list of most promising candidate genes. GeneDistiller (Seelow et al. 2008) is a web-based application which offers a customizable prioritization of genes as per user requirements. The user can read the different kinds of information about the genes, select specific genes by applying some filters, sort the genes on basis of parameters and also prioritize genes as per required specification. For the prioritization task, the user can also assign different weights to different parameters considered while calculating the final score. But sometimes due to the strict filtering mechanism, filtering-based techniques sometimes also filter out the genes only if one criterion is not met thus producing false negatives.

2.2 Similarity-based techniques

Similarity-based techniques perform prioritization by calculating the similarity between candidate genes and seed genes. The similarity is identified either on the basis of disease-related keywords searched from disease-related documents or on the basis of similarity between properties of genes. Yu et al. (2008) have investigated various vocabularies and representations and used various ranking algorithms to perform prioritization. Perez-Iratxeta et al. (2002) developed a scoring system using data mining to prioritize genes using information from Medical Subject Headings (MeSH) (Lowe and Barnett 1994). The authors used information from MEDLINE and NCBI RefSeq database to derive relationships among functional terms, chemical terms and pathological conditions. A software named ENDEAVOUR (Aerts et al. 2006) performs gene prioritization on the basis of similarity with the training properties obtained from various data sources individually and then combining the results of all the ranks obtained from different sources into one list. Adie et al. (2006), developed a web-based server SUSPECT which uses the weighted combination of sequence-based and annotation-based approaches to prioritize genes. van Driel et al. (2006) used text mining to calculate disease similarity information by calculating the cosine of angle between the vector representation of each Online Mendelian Inheritance In Man (OMIM) record. In Schlicker et al. (2010) the authors introduced an approach MedSim, which uses functional comparisons to rank genes. The approach particularly automatically annotates GO terms with disease entries to create functional profiles of genes.

Over the time, various methods to calculate similarity between genes and diseases have been developed. Some these methods measure similarity on the basis the probability of the appearance of GO terms, their distances to the closest common ancestor term, the kappa statistics, cosine similarity, and many others. Pesquita et al. (2009), the authors have reviewed various methods used in biomedicine to calculate the similarity and provided the categorization, comparative assessment and various implications of the methods. Although the text mining-based approaches have performed well but they experience the ill effects of the issue of absence of steady representation or association of key ideas which often lead to redundant data. In addition, for the well characterized diseases, i.e., those diseases for which association of genes have already been identified in the past, there is more information retrieval about them using text mining techniques as compared to more newly identified diseases.

2.3 Network-based techniques

A network-based technique primarily uses the topology of the network for ranking the genes. In these techniques, the interactions among the proteins or genes in the organisms are represented using graphs, where hubs represent the entities and the edges represent the connections among them. This graphical representation helps in the visualization of the network which enhances the understanding of the complete biological structure. Various pieces of evidence propose that practically related qualities prompt to comparative phenotype (Oti and Brunner 2007). In addition, many studies have suggested that the genes responsible for the same or comparative diseases are generally located in one of the same biological modules such as in a pathway (Wood et al. 2007), protein complex (Lage et al. 2007), or a subnetwork of protein interactions (Lim et al. 2006). Thus, using these biological modules to understand the cause of the disease is quite helpful. The set-back to such techniques is the scarcity of knowledge of the disease which limits its progress.

3 Network-based disease gene prioritization

Based on their structure and properties, biological networks are divided into three types (Morris et al. 2015a): pathways, interaction networks, and similarity networks. While pathways are used to represent metabolic, signaling or regulatory pathways; interaction networks illustrate the interactions among biological products like genes, amino-acid residues, proteins, metabolites, small molecules, diseases, etc. for example gene regulation networks, metabolic networks, protein interaction networks, and signal transduction networks. In similarity type, nodes represent some biological entities, while edges represent the relation among nodes on the basis of some similarity measure forming networks like metabolic correlation networks using profiling data (Weckwerth et al. 2004), gene or transcript correlation network using expression data (Bergmann et al. 2004) or etc. Amongst the above discussed biological modules, it has been noticed that Protein–Protein Interaction Networks (PPINs) can possibly distinguish the disease-causing genes as these add to a better understanding of disease pathways (Safari-Alighiarloo et al. 2014).

A Potein–Protein Interaction Network (PPIN) is often abstracted as a graph model, where G (V, E) is an undirected graph with a set of nodes-V denoting proteins and set of edges-E denoting interactions among the proteins. Thus determining the properties of genes using the interactions, is what became the basic building component for the development of computational approaches for disease gene prioritization. These approaches work on the principle of “guilt-by-association” (Oti and Brunner 2007) according to which the genes that are functionally or physically related to disorders are more likely to be positioned closer to each other. Thus, on the basis of proximity to disease genes the likelihood of candidate genes being related to the disease can be computed.

Protein–protein interaction networks model physical interactions and functional relationships among the proteins. With the principle objective to identify protein interactions various large-scale experiments have also been conducted such as a high-throughput yeast two-hybrid (Y2H) system (Ito et al. 2000) which discover direct binary interactions among the genes, tandem affinity purification followed by mass spectrometry (Morris et al. 2014). In addition, various text mining applications that provide manually validated and/or computationally predicted PPIs from biological data are available for public use (Blaschke et al. 1999). Thus large-scale identification experiments lead to the generation of billions of interactions and these interactions have been managed in databases such as DIP (Xenarios et al. 2002), BIND (Bader et al. 2003), STRING (Szklarczyk et al. 2011), MINT (Licata et al. 2012), etc. This interaction information is further used to construct a PPI network. Since visualizing the interactions and network is a difficult task various bioinformatics tools such as GenMAPP (Dahlquist et al. 2002), Osprey (Breitkreutz et al. 2003), VisANT (Hu et al. 2005), CellDesigner (Funahashi et al. 2003), PIANA (Aragues et al. 2006), Cytoscape (Cline et al. 2007), Shannon et al. (2003) have been developed along with the various plug-ins. These tools have made the work of the researchers easier. Despite the incompleteness and biased nature of PPI data, PPI networks recently due to its high-coverage and high-quality are one of the most widely used biological networks in predicting and detecting protein complexes (Stelzl et al. 2005). Various topological parameters can be used to identify the collective, subnetwork and individual behavior of the network components. On the basis of topological parameters, the network-based prioritization techniques are broadly classified into local measures-based techniques and global measures-based techniques.

3.1 Local measures-based techniques

The early methods in gene prioritization focused on local network information of the network. Table 2 contains a list of some of the local measures which can be used for analyzing networks.

Table 2 List of local parameters

The most straightforward approach is based on the interconnectedness of the genes, i.e., whether the given genes are connected with disease genes directly in the PPI network. This can be determined either using a direct-neighborhood approach or shortest path approach. Krauthammer et al. (2004) proposed molecular triangulation as a method to identify genes responsible for genetic variations. The authors used text-mining approaches to extract interactions from research literature available online. Each node was initially assigned a primary-evidence score which was then projected to nearest neighbors. The evidence score of a node is the summation of near neighbors’ secondary-evidence score received from nearest neighbors. This evidence score was further used to identify the genes, as the gene which is nearer to more seed genes will have a higher score. Oti et al. (2006) used the direct-neighborhood based approach for predicting disease genes by counting the number of first-degree candidate genes known to be linked with disease genes.

But direct-neighborhood based approaches failed to take into account the fact that the disease genes are also the part of same pathways. Dezső et al. (2009) proposed an algorithm that builds a shortest-path network which connects condition-specific disease genes. The authors calculated the significance score of each node on the basis of the number of paths from the particular node in the shortest path network to the total paths in the complete network. Xi et al. (2016) compared the performance of three different human PPI networks from three different interaction sources that are literature-curated (LC), experimentally derived (EXP) and predicted PPIs. The authors used five topological features namely node degree, 1 N index, 2 N index, the average distance to known disease gene and positive topological coefficient to train a k-Nearest Neighbor (kNN) classifier to predict the genes with a higher likelihood of being involved with the disease. Hsu et al. (2011) considered two factors: one direct interaction between two genes and the second number of connectors between genes for calculating the interconnectedness score of the gene on the basis of which genes were prioritized. Zhu et al. (2012) formulated a vertex-similarity-based approach based on the similarity between two genes. The authors had used both direct neighborhood and shortest path approach to calculate the similarity score between two genes. Li et al. (2014) proposed a simple localized algorithm SPranker which performed prioritization on the basis of topological similarity calculated on the basis of the shortest path from the disease gene. Table 3 contains the list of summary of local measures-based techniques.

Table 3 Summary of local measures-based techniques

3.2 Global measures-based approaches

While the local measures consider only the neighborhood of the node, the global measures consider the complete topology of a network for ranking genes. Various techniques based on global measures have been proposed by researchers to perform disease gene prioritization some of which are discussed as follows.

3.2.1 Graph centrality

Wang et al. (2011) proposed to integrate different overlapping disease signatures on the basis of graph centrality. The authors used six graph centrality measures listed in Table 4 to quantify the importance of a gene in disease, which is along with their definitions on the basis of which prioritization was performed. Higher the value of the centrality measure more is the probability of a gene being related to the disease.

Table 4 List of centrality measures

3.2.2 Random walk with restart (RWR)

A random walk on graph is defined as a transition from the present node to a randomly chosen neighbor iteratively starting at a given source node, s. Random walk with restart is a variant of random walk in which the walk is restarted at node every s time step with probability r following equation:

$$p^{{\left( {t + 1} \right)}} = \left( {1 - r} \right).Wp^{\left( t \right)} + r.p^{\left( 0 \right)} ,$$

where \(W\) represents a column-normalized adjacency matrix and \(p^{\left( t \right)}\) is a vector, where the ith element of the vector represents the probability of being at node i at time step t and \(p^{\left( 0 \right)}\) is the initial probability vector.

Köhler et al. (2008) defined the similarity between the genes in the network using RWR. The gene ranking was done on the basis of the steady-state probability vector obtained after the change between the values of two iterations is very small (below 10−6). Erten et al. (2011) proposed a gene prioritization scheme on the basis of topological similarity in which the global signature of the location of gene is used to perform disease gene prioritization. The topological profile was constructed using the proximity of protein to every other protein in the network. The authors assessed topological similarity using random walk proximity as a feature. They proposed three different schemes on the basis of topological similarity: average topological similarity with seed genes (ATS); topological similarity with average profile of seed genes (TSA); topological similarity with the representative profile of seed genes (TSR).

Li and Li (2012) proposed two different algorithms based RWR using multigraphs. A multigraph is a graph constructed using different gene networks from multiple data sources and then merging them forming a one single network. The first algorithm RWRM (random walk with restart for multigraphs) simply extends random walk with restart on multigraphs. The second algorithm complex heterogeneous network works by combing multigraph gene network and phenotype information and then performing random walk. Le and Kwon (2013) tried to further enhance the performance of RWR based methods by weight adjustment technique in which weight reinforcement of interactions close to known disease genes is done.

3.2.3 Diffusion kernel

The diffusion kernel K of a graph is defined as

$$K = e^{ - \beta L} ,$$

where \(\beta\) is the magnitude of diffusion and L is Laplacian of the graph.

Köhler et al. (2008) used the method of diffusion kernel to rank genes on the basis of the score computed using the following formula:

$${\text{Score}}\left( j \right) = \mathop \sum \limits_{{i\,\epsilon\ {\text{disease}}\;{\text{gene}}\;{\text{family}}}} K_{ij} .$$

Zhang et al. (2011) proposed a Bayesian regression approach that uses diffusion kernels on PPI networks that prioritize genes on the basis of Bayes factor which calculates the strength of similarity between disease phenotypes using the gene proximity profile containing pairwise similarity between every pair of genes.

3.2.4 Network propagation

Vanunu et al. (2010) proposed a method, PRINCE which uses prioritization function following logistic regression model-based approach to assigning confidence scores to the genes. The method is similar to RWR with the major difference of using weight matrix which is derived from the adjacency matrix. The key idea was to normalize each edge weight by the degrees of both of its endpoints in contrast to the random walk with restart which considers only one endpoint.

3.2.5 Page rank with priors

Page Rank with Priors (PRP), another algorithm that is used for gene prioritization is an extension of Page Rank. In this algorithm, a prior probability is assigned to each node and further with some probability factor it chooses amongst its neighbors which node to move next. The iterative equation (Gonçalves et al. 2012) it uses is as follows:

$$p_{v}^{{\left( {i + 1} \right)}} = \beta .p_{v}^{\left( 0 \right)} + \left( {1 - \beta } \right).\mathop \sum \limits_{{\left\{ {u\varepsilon V,\left( {u,v} \right)\varepsilon E} \right\}}} p_{u}^{\left( i \right)} .\frac{{w\left( {u,v} \right)}}{d\left( u \right)},$$

where \(\beta\) is back probability denoting the probability of jumping to initial node at each step. Chen et al. (2009) used page rank with priors to rank genes based on the set of known disease genes. Sinsha and Chitturi (2016) highlighted the modified versions of the page rank algorithm to deal with the existing problems of gene prioritization such as bias nature towards neighbors of high degree nodes and central nodes.

3.2.6 Topology potential

Li et al. (2015) proposed a gene prioritization method by considering each protein as a material particle. Each particle around itself generates a potential field and all the interaction and thus results in the formation of a topological field over the entire network. The authors ranked the proteins on the basis of their topology potential value and selected the top x % proteins to identify essential genes as per the requirement. The authors also integrated other centrality measures with topology potential to enhance the performance of the system.

3.2.7 Machine learning

Machine learning has also been used in the field of disease gene prioritization. Lui et al. (2016) defined a Generalized Bi-relational Network (GBN) and used a series of common structure-based features named number of neighbor (NG), Jaccard coefficient, common neighbors (CN), neighbor of neighbors (NN), total neighbors (TN), Adar similarity(Adar), friends measure(FM), cosine similarity, preferential attachment score(PAS), Katz measure for classes of path weighting features. Common features, 4 classes of path weighting features and union of all features were used to compare of supervised-learning models: logistic regression, decision tree (DT), random forest (RF), and K-Nearest neighborhood (KNN). All path weighting features were also calculated using the unsupervised model. Lui et al. (2020) have applied the popular technique, i.e., machine learning for network-based classification. The authors have used supervised learning using the neighborhood of each gene’s network as its feature vector using the influence matrix, adjacency matrix, and node-embedding matrix.

3.2.8 Soft-computing techniques

Grewal et al. (2016) proposed to assign different weightage to different measures and thus used aggregation operators. Aggregation combines different values into a single value which has a combined effect of all the values. The authors extracted values of four different measures, assigned each a fuzzy association score and then combined them using aggregation operators. The aggregation score finally computed for each gene was used to form a prioritization list.

3.2.9 Community detection

Indulekha et al. (2018) presented an algorithm based on community detection in a PPI network to perform the task of gene prioritization. The main idea followed is the identification of community in a graph which is a collection of closely connected nodes. These detected communities are used in analyzing the properties of the network. The authors have used Gervan–Newman (2004) algorithm for the process of community detection which works using the edge betweenness parameter.

Table 5 contains a summary of global measures-based techniques.

Table 5 Summary of global measures-based techniques

4 Integration of heterogeneous data types for disease gene prioritization

The performance of a prioritization technique is heavily influenced by the data provided to the system, because if the seed gene set is too small, the detected pattern is insufficiently informative. With the increasing availability of data, the focus has shifted from using single omics data to integrating it with more heterogeneous data. This multi-omics approach allows better and deeper understanding of the complexity of phenotypes and also helps deal with biasness and missing nature of data. Using different types of biological information related to a disease enhances the knowledge about the disease and has reported to improve the performance of prioritization techniques. Thus, researchers proposed combining protein network information with different types of biological data.

There are mainly two approaches which are followed for the process of integration of data. The first approach involves consideration of data from different sources separately while accessing the genes and then finally aggregating the resultant multiple ranked lists into one final ranking list. The integration of lists is done using one of the many integration techniques like N-dimensional Order Statistic (NDOS) (Stuart et al. 2003), discounted rating system (DRS) (Li and Patra 2010), kernel based fusion (Yu et al. 2011), etc. The alternate approach is integrating the information from various data sources into a single network. A comprehensive network with all types of data is constructed in which different layer contains different type of data and are connected to each other through gene-phenotypic associations, and then the task of prioritization is performed. Various publicly accessible databases like OMIM (Amberger et al. 2019) provides phenotypic–genotypic relationships in human. In Zhang et al. (2019), the authors have constructed quadruple layer heterogeneous network to perform gene prioritization.

4.1 Gene ontology

The gene ontology gives controlled vocabularies of characterized terms speaking to quality gene properties. These cover three spaces: cellular component, molecular function, and biological process. It is organized as a coordinated non-cyclic chart, where every term has characterized connections to at least one different term in a similar space, and now and then to different areas. Li et al. (2014) also proposed another algorithm, i.e., SPGOranker, an improvement over SPranker which uses GO term similarity. The improved algorithm integrates the molecular functional similarity with the SPranker which works on the shortest distance method. The validation results obtained show the improvement in detecting true disease-causing genes with the integration of GO annotations.

Gentili et al. (2019) have proposed two heuristics using similarity-based on Gene Ontology-Biological Node Relevance (BNR) and Biological Random Walk (BRW). In the former technique, the authors have sorted genes on the basis of the score calculated as an intersection between biological information and annotations. BRW is a majorly RWR with the main difference that in BRW the matrices calculated depend on the biological information of the available genes.

4.2 Gene expression

Gene expression is an interpretation of data encoded in quality into protein or RNA structures that are available and working in the cell. Communicated qualities incorporate qualities that are interpreted into delivery person RNA (mRNA) and afterward converted into protein, and in addition, qualities that are deciphered into RNA, for example, exchange and ribosomal RNAs, yet not converted into protein. Proposed ESFSC, a method that simulating Random Walker for the identification of disease-related genes using topological properties of genes and gene expression correlation. The authors formulated the seed set based on known disease genes and their k-nearest neighbor, and initial probability scores were assigned to genes on the basis of the location of the gene. Furthermore, random walk with restart was implemented which was guided by similarity transition matrix and correlational transition matrix constructed on the basis of topological similarity properties and gene expression profiles, respectively. The final ranking of the genes was then done based on the steady-state probabilities obtained by combing results of random walk of two different matrices.

4.3 Orthology

Orthologs are homologous proteins that are gotten from a typical precursor. They generally have high comparable amino corrosive successions and hold the same or fundamentally the same as capacities. Peng et al. (2012) used the orthologous property of genes to perform gene prioritization. The authors proposed a method ION which combines three features: connections among the genes, the orthologous property of genes and features of neighbors. Initially, the orthologous score was being assigned to each gene and then the adjacency matrix was calculated based on the edge clustering coefficient. Then the rank of each gene is calculated on the basis of orthologous score and neighbor induced score.

4.4 Protein complexes

Yang et al. (2011) proposed a method RWPCN which perform prediction and prioritization of genes using human protein complexes and PPI network. The integrated network consisted of three layers with the phenotype network at top connecting phenotypes based on similarity scores, protein complex network connecting phenotypically-related protein complexes in the middle and PPI network connecting two interacting at the bottom. The authors have used both random walk algorithm and flow propagation for assigning scores to proteins which are further used for ranking them for prioritization. Jiang et al. (2014) proposed a random-set scoring model using protein complexes and text mining of biomedical literature for prioritizing the genes in a protein network. The authors’ collected gene-associated phenotype data from three different sources and the vocabulary filters defining phenotypic profiles from four sources and the cosine coefficient between the two represent the semantic similarity.

4.5 Tissue-specific information

Deng et al. (2016) proposed disease gene prioritization on a heterogeneous network using tissue-specific information. The heterogeneous network was built by the integration of tissue-specific PPI networks with phenotypic features. The authors implemented the proposed heterogeneous network approach on other existing prioritization approaches: Random Walk with Restart (RWR), PRINCE, Guo’s Method used individually.

Xi et al. (2016) prioritized atherosclerosis-related genes using text mining by performing the systematic analysis: pathway analysis, functional clustering analysis, and gene network analysis. The functional clustering analysis of genes that including gene ontology (GO) analysis identified over-represented genes. The pathway analysis identified the most highly enriched pathway and the gene network analysis prioritized genes following the hub method. Table 6 contains a summary of methods using the heterogeneous type of data.

Table 6 Summary of techniques using a heterogeneous type of data

4.6 Protein expression profile

Ren et al. (2019) have evaluated the performance of local ranking and global ranking methods by integrating protein expression profiles with the PPI network. Three different protein sets with prior disease knowledge are used. For local ranking methods, they ranked genes using the number of disease genes directly related to the gene and for global ranking, heat kernel method for restarting random walk on one of the three networks with prior disease knowledge is used.

Table 6 contains the summary of proposed network-based techniques using a heterogeneous type of data.

5 A basic workflow of network-based disease gene prioritization

A gene prioritization problem can be formulated as given a disease (D) of interest and a list of genes, the task is to rank these genes based on the likelihood of the genes to be involved with disease. The approach to the problem using the network-based technique is shown in Fig. 1 with details of each as follows:

Fig. 1
figure 1

Network-based disease gene prioritization approach

  • Collection of candidate genes: The quality of the prioritization process is greatly affected by the set of genes that needs to be prioritized. In addition, the size of the candidate set is a major factor, i.e., it can be a group of genes or the entire genome. There are various primary and secondary data sources such as microarray data, data from public repositories, data extracted from other biomedical literature available online which can be used for experimentation purposes.

  • Collection of disease-related information: With the advancing research, the disease-related knowledge is now available abundantly which can either be extracted from the biomedical literature online or from public databases that have been specially formulated to gather the information. This disease-related information is called the seed set(S).

  • Preprocessing of data: There are different ways in which data is represented such as a matrix of numbers, graphs, list of terms, etc. The way data is represented also influence the way the prioritization technique is being used. In addition, in case of data integration data from multiple sources needs to be merged efficiently.

  • Merging of information: Next the merging of both types of data, i.e., candidate data and disease-related information is done. This is an optional step and depends on the computational strategy being implemented for prioritization.

  • Implementation of computational strategy: After the formulation and preprocessing of the data set, the computational approach for prioritization is being implemented on it. These computational approaches are of various types and have been discussed in detail in the previous sections. Thus, on the basis of various features, the process of prioritization is performed which gives as output a list of genes ranked on the basis of the probability of these genes to be involved in the disease.

  • Assessment of prioritization: Proper assessment of the system is very important to analyze the effectiveness of the system such as benchmarking, cross-validation, using the negative dataset, functional enrichment, etc. Various performance measures such as the area under ROC curves, sensitivity, specificity, fold enrichment, etc. are used to validate and compare the performance of the system with other existing systems.

6 Discussion

Disease gene prioritization aims at identifying the genes which are most likely to be related to the disease. It is helpful for doctors to identify target genes and designing efficient treatment systems. It also is helpful in the prevention and early diagnosis of disease. In this review, various approaches to network-based disease gene prioritization based on PPI networks have been discussed. Although with advancing technology various approaches have been proposed still there are problems which need to be handled.

The biological data is the core of gene prioritization models and the performance of any prioritization model highly depends on the quality of data. One of the main problems with the data is the incompleteness of the data. Although there are many techniques available for identifying interactions among genes still there are missing links in the data. These missing links limit the performance of the system. Mean or median of observed values could be one possible way to interpolate missing data. In addition, the biological data is retrieved using parsing of available literature, so there is a possibility of errors due to falsely identified genes or missing links. This degradation in the quality of data could be improved by integrating data from various sources.

Data integration from different sources is assumed to deal with the problem of missing and falsely identified data. But acquiring and integrating data from various sources itself faces the problem of lack of consistent representation of data, as the same concept can be represented in different ways which sometimes leads to redundant data. Almost all approaches work on the basis of the prior knowledge available about the disease, but due to incompleteness of data, the results are biased towards the genes previously identified to have some relation to the disease. The approaches which have training set dependent on the disease are biased towards the well-studied set of genes.

So, one way to deal with these problems is to first integrate data from different data sources and then process that data by retrieving only the data which have significance in multiple data sources and leaving those with lesser significance. Another way is to use heterogeneous data related to genes such as ontology, annotations, pathways, etc. to improve the performance of the system. However, combining data from multiple resources is also a difficult and time-consuming task.

Another problem related to gene prioritization approaches is the regular updation of data sources. With the passage of time, the datasets also keep changing. So whenever the dataset is updated, the techniques which are parameter dependent needed to be returned to optimize the performance. Since all changes in a system cannot be automated it is difficult for the researchers to keep updating the data frequently.

Many researchers have proposed approaches based on the integration of heterogeneous types of biological data into a single network. Even though these approaches performed better as compared to others but integrating and organizing biological data into layers of networks is a difficult task as all elements are interrelated. In addition, biological information about various species is not available.

Different approaches use a different type of data and format. So, there is no standard data to compare the performance of different approaches. In addition, there is a need for improvement in benchmarking efforts to provide the global platform for the comparison of approaches.

7 Conclusions and future scope

In this review, the importance of the disease gene prioritization technique has been highlighted. We discussed various categories into which the technique has been categorized. The filtering-based techniques excludes the gene even if it does not match a single criterion without considering others. The similarity-based methods often fail to handle diseases about which no prior association information is known. Thus, major focus was to point out the progress made in disease gene prioritization techniques based on the PPI network. Gene prioritization techniques use various properties of proteins and genes to rank them on the basis of their locational proximity to the disease genes. An improvement over it is to use data from multiple resources which reduces the effect of missing and false data. Integration could be done in two ways either by first combing data from heterogeneous sources and then perform ranking or by individually ranking genes from different resources and then combine the results of ranked lists into one. But the performance of these topology-based methods is reasonable as the results are biased by the network topology. Thus, the performance of prioritization techniques could be enhanced by integrating different types of biological knowledge about genes with the network information. This integration attributes to further performance enhancement of techniques as knowledge integration covers rich information that could not be captured by network analysis. This integration of information can be made more productive with the help of functional analysis. Functional clustering analysis identifies the disease-causing genes based on the annotations which are significantly enriched. This method also identifies the annotations which are overly represented. Using enrichment analysis to combine different types of information can help researchers to use annotations that are more functionally enriched which might yield better results. Higher performance enhancement can be expected when more complete datasets are used.