Keywords

1 Introduction

Network biology has emerged as one of the most promising angles of systems biology and systems medicine [14]. As realization of any biological function or process is fully dependent on the interactions of small and macro molecules rather than individual proteins, RNAs or metabolites, it is almost undisputed to require biomedical researchers to work on interactions of cellular constituents and the resulting networks composed of nodes and interactions. The researchers from network biology, as the name suggests, mainly deal with various types of networks in biology. The major conceptual contribution of network biology to the entire biomedical community is to help and guide basic, translational and clinical researchers to investigate systematically the functions of genes/proteins or metabolites based on interlinked molecular networks rather than isolated single pathways or even individual genes. With the ability to gain a holistic understanding of cellular activities, network biologists have a unique power in characterizing and predicting robustness, flexibility, redundancy, and many other evolutionarily-conserved systems features in various kinds of uni- and multicellular organisms. There are numerous types of intra- and intercellular networks. For example, in the classic metabolic networks, where metabolites react with each other to consume and produce metabolites, the nodes represent metabolites and links between nodes designate metabolic reactions between linked metabolites. Another important type of molecular network is the gene regulatory network (GRN), in which one node corresponds to one gene and one arrow from one gene to another indicates a transcription regulatory relationship. In contrast to these directed networks, researchers have also started to reveal undirected networks such as protein–protein interaction (PPI) networks in which one node represents one protein and a link between two nodes indicates a physical interaction of the given two proteins. In real life, not to mention intercellular networks, all the various network layers are crosslinked with each other, introducing further complexity in understanding the molecular mechanism underlying any biological process. To handle better the obstacles to understanding the complex cellular mechanisms driven by dynamic multiscale interlocked networks [5, 6], network biologists visualize, construct, and analyze various molecular networks from different entry points.

Currently, the major trends in network biology are to infer and reconstruct various types of molecular interaction networks, for example to reconstruct and infer metabolic networks, GRNs, signaling transduction networks, PPI networks, and other types of networks. Particularly, with the development of large-scale ‘omics’ measurement techniques, many types of genome-scale data from different tissues and organs and cell types with and without different perturbations are increasingly available, including genome, transcriptome, proteome, metabolome, miRNAs, and others which further drives the fast advances in various computational approaches in network inference and reconstruction. For technical details in network reconstruction and inference, also known as data-driven reverse engineering, the reader should refer to other reviews [710].

Following the accomplishment of the reconstruction or inference of the given molecular networks, we cannot just stay at the stage of being impressed by the visualized large complex networks. Although it is apparent that the current available large-scale datasets, knowledge-based interaction databases, and computational approaches do not yet allow us to obtain a complete and fully-accurate molecular network for a given organism or cell, it cannot stop us from going forward. The next wave of objectives are to make use of these complex molecular networks that are still incomplete and far from being fully accurate.

Until now, many efforts have already been made mainly in analyzing metabolic networks, which are relatively more accurate and complete than other types of molecular interaction networks. Among many applications, in the metabolic network it has been well established that one could calculate all the potential elementary pathways or extreme pathways between the starting metabolites and final products [1113]. Such pure stoichiometric and topological properties-based approaches are extremely valuable for investigating the redundancy nature of molecular networks. However, all the physiological and pathological processes take place with certain resource constraints, indicating that the solution space should hopefully narrow down. Based on the reconstructed metabolic networks and given constraints, pioneers have already been able to predict effects of gene deletions on cell viability of Saccharomyces cerevisiae with a pretty high accuracy using flux balance analysis [14]. More practically, the metabolic network has recently been used to predict drug targets or a combination of targets against human cancer cells by identifying essential genes for cell proliferation or synthetic lethal genes [15]. Predicting essential genes for cell proliferation or survival based on metabolic networks is in fact the first successful application of network biology. Possibly because of the nature and essential function of metabolic networks, namely maintaining cell survival and multiplicity, one can hardly predict genes critical to other important functions rather than only viability and proliferation. Many more functions can be executed even by bacterial cells, not to mention mammalian cells. In the meantime, many more layers of molecular networks fortunately exist to complement the onefold functions of metabolic networks. The extra complexity also increases the difficulty for us to predict key genes for a given cellular function or process.

It is important to notice the essential discrepancy between key gene predictions for a given biological process or function and that for a specific disease. The latter has been well reviewed [16, 17] whereas the former is still open to be summarized. Although any dysfunction for the disease key genes often cause a catastrophic consequence to human health, the expression or functional changes of key genes for the former cases might only alter the activities of a particular biological pathway which might not necessarily lead to a severe event at the whole cellular or organism level. To identify those playing a quantitatively regulatory role, which might be affected or compensated by the existence of redundant pathways of molecular networks, is even more challenging than to identify the death-or-survival essential nodes from the networks. Furthermore, the readout of particular activity changes in many given cellular processes or functions is often not so obvious in contrast to disastrous symptoms or not so easily examined by current techniques.

In this review we mainly summarize the recent advances in the key gene discovery for a given physiological or pathological process based on GRNs, PPI, and expression correlation networks. We also survey several selected examples in the context of immunology and infection. Through comparisons we support more implementation of expression correlation network-based key gene predictions that are still non-mainstream. Furthermore, we present our views on potential challenges and future directions.

2 GRNs-Based Key Gene Discovery

2.1 Hub-Based Prediction

The long-term dream of biologists is to understand fully how one or multiple transcription factors dynamically and quantitatively regulate the target genes [1820], the regulatory cascades composed of chains or trees of linked interactions, and eventually the overall GRN in the given cell type or organism. In the last few decades, enormous progress has been made in understanding the regulatory relationships between individual regulators and some of their target genes. Accumulating results in individual regulatory interactions by finer-scale or large-scale high throughput experiments cannot automatically help us to understand how the GRNs control cellular functions and processes. To understand better these regulatory processes systematically, the first step is to assemble and reconstruct the GRNs from various public available or commercial interaction databases. In parallel, although significant progress has already been made, computational techniques also need to be further developed to infer GRNs from a single type or multi-type of genome-scale data such as ChIP-chip, ChIP-seq, RNA-seq/microarray, proteome, metabolome, and others with or without time-series and knockdown/knockout measurements [8, 2123]. Following the reconstruction and inference of GRNs, researchers aim to predict whether there are any novel key regulators for the biological process or the given disease of interest. Biomedical researchers are of course also highly interested in identifying novel drug targets for therapy of the disease of interest by analyzing GRNs. The most easily acceptable and straightforward approach is probably to identify the most highly interlinked regulators, also known as hubs [24]. It is intuitive that with more target genes of a given regulator in the GRN network, the more important the corresponding regulator is in terms of controlling or ‘influencing’ the given network or subnetwork, indicating the important role of the hubs in the given cellular activities or diseases.

For instance, by analyzing the hubs, Della Gatta et al. [25] identified a novel tumor suppressor in T cell acute lymphoblastic leukemia (T-ALL) from the inferred TLX interaction networks. The authors first predicted the potential interaction network from T-ALL patient samples using the widely-used ARACNe approach based on mutual information [26]. They further constrained such a huge general network down to a subnetwork by using genes that match with both criteria, not only direct targets of TLX1 and TLX3 based on ChIP-chip results, but also differentially expressed in TLX1- and TLX3-expressing T-ALL samples. Extraction of such a subnetwork greatly helped them to pinpoint the most highly linked regulator, RUNX1, the top ranked hub in that subnetwork, as a novel mediator in T-ALL. Furthermore, both human RUNX1 mutation analysis and murine Runx1 heterozygous knockout mice analysis have confirmed that RUNX1 is indeed a tumor suppressor in T-ALL.

It is also highly attractive for researchers to predict novel key regulators controlling the mode of action for a novel drug. We recently attempted to identify novel key mediators underlying the mode of action for a novel inhibitor Carolacton against the biofilm formation of the lead human oral pathogen Streptococcus mutans [27]. We first generated a time-series transcriptome following treatment of Carolacton and predicted the time-shifted correlation network using a Trend Correlation (TC) approach [28]. Furthermore, we combined a time-shifted correlation network with computed regulator-binding site maps and consequently obtained the Carolacton-response GRN. Following this, we identified CodY as the top ranked hub, which is a well-known global master regulator and its deletion in Streptococcus mutans signficantly reduces biofilm formation [29]. We thus no longer need to validate CodY. Next to CodY, CysR is the most linked regulator and has been selected as the most promising novel candidate. Interestingly, our experimental results show that deletion of CysR has successfully reduced the sensitivity of Streptococcus mutans to Carolacton treatment, indicating a successful prediction of a novel key mediator underlying the mode of action for the inhibitor.

2.2 Biological Feature-Based Prediction

Rather than simply checking the topological features, for example the top hubs of the inferred GRNs, some groups have examined a combination of several biological features for candidate regulators. For instance, Yoself et al. aimed to identify novel regulators controlling Th17 cell differentiation [30]. They first generated a time-series transcriptome dataset while they were reconstructing a general network of regulator-target associations from various published genomics profiles. They then assigned a regulatory interaction between a regulator and a given gene for a certain time window if there is a significant overlap between the regulator’s putative targets and the cluster in which the given gene is located. Following the inference of the potential GRN controlling Th17 cell differentiation, they ranked all regulators by emphasizing certain biological features over others (e.g., they first order the regulators based on whether they are predicted regulators of key Th17 genes, then, based on the criteria, whether they are differentially expressed over the time-series data). It makes sense to order in that manner as, apparently, activities of many transcriptional factors are mainly modulated at posttranslational and/or posttranscriptional levels (also refer to the discussion below). With a joint effort from different groups, they could successfully demonstrate at least 4 out of the 12 selected novel factors significantly mediate the Th17 program in knockout mice models or knockdown cells.

2.3 Translated mRNA-Based Prediction

Researchers have recently moved beyond the classic transcriptome-based GRN analysis. As it is now well known that a big fraction of mRNAs might not necessarily be translated into proteins, Brichta et al. have aimed at identifying neurodegenerative factors using the predicted regulatory networks from the so-called translatome, that is all the translated mRNAs [31]. The authors have first reverse engineered a general murine brain regulatory network from a mouse whole brain transcriptome database. They then profiled the translatome by measuring the translated mRNAs specifically expressed in dopaminergic neurons (DNs) of a model of Parkinson’s disease in which DNs are under progressive loss. Projecting the differentially expressed translatome to the general murine brain GRNs, the authors predicted 19 candidate regulators which drive the molecular signature mediating the degeneration of DNs at an early stage. Remarkably, none of the 19 are significantly differentially expressed in mRNAs based on classic statistical analyses [32]. As substantia nigra pars compacta DNs are more vulenerable to degeneration than ventral tegmental area DNs, the authors have further tested two predicated regulators which were expressed higher in substantia nigra pars compacta DNs. Interestingly, virus-mediated knockdown of the two selected genes has successfully caused loss of substantia nigra pars compacta DNs, recapitulating the results obtained in a Parkinson’s disease murine model.

2.4 Multilayer Network-Based Prediction

Analysis based on a single layer of network, for example GRN, might be limited because the activities of many transcription factors are mainly tuned at the posttranscriptional and/or posttranslational levels [33, 34]. Therefore, we have recently extended the analysis to integrate multilayer molecular networks. We aim to identify novel factors contributing to survival of CD4+ cells following chronic infection of SIV [35]. To this end, we first integrated a general human molecular network by including GRN, PPI, and signaling transduction networks from various databases. Then we projected the transcriptome measured from human CD4+ cells with chronic infection vs acute infection of SIV to the general integrated molecular network. We predict a regulator as a key regulator only if the number of differentially-expressed putative targets are significantly enriched and meanwhile there exist short length of significantly affected interactive molecular chains in the upstream of the given non-differentially-expressed regulator. Consistent with the discoveries aforementioned by Brichta et al. [31], we predicted 13 ‘hidden’ non-differentially-expressed key regulators by using our network approach, named as Inter-Chain-Finder (ICF). Six out of the 13 predicted key regulators are known to interact with HIV. These predicted key ‘hidden’ regulators are all involved in the regulation of cell growth, again underscoring the predictive power of the ICF approach. These non-differentially expressed ‘hidden’ key regulators cannot be detected or prioritized by other classic approaches, indicating the advantages of the ICF approach.

3 PPI Network-Based Key Gene Discovery

3.1 Centrality-Based Approaches

With the development and wide application of high throughput techniques to detect protein–protein interactions, for example yeast two-hybrid (Y2H) and co-immunoprecipitation-mass spectrometry approaches, we are moving toward generating comprehensive PPI networks for various organs, tissues, and cell types. The availability of increasingly comprehensive PPI networks provides us with another great opportunity to dig out novel key genes for a given biological function or process as most, if not all, biological functions can only be realized through physical interactions between various proteins and complexes. The question is which characteristics of the PPI network are unique or helpful to predict key genes. To address this question one first needs to find out which general characteristics the PPI networks possess for most organisms. In the early contribution of Barabasi and others, it was found that the PPI networks for all the studied organisms share a common characteristic, the scale-free property, namely most nodes have only a few links whereas a few other nodes, the so-called hubs, have a very high degree of connection [1, 36]. Similar to what researchers have done with the GRNs, the first proteins predicted to be disease-relevant are also the hubs or nodes as defined by other centrality measurements [37] as disease genes are more inclined to encode hubs in the PPI networks than non-disease genes [38]. However, this observation could be biased by the fact that researchers are more willing to investigate disease-related genes and consequently mainly identify PPIs for those disease-related genes, which is possibly further supported by all the leading funding bodies. Furthermore, analysis shows that it is mainly lethal-causative or essential disease genes that often cause spontaneous early severe diseases which are more likely to be the hubs than genes mediating other types of diseases [39]. Parallel ideas to distinguish the so-called date and party hubs, where the former appears more dynamic than the latter, might be helpful in identifying non-lethal disease genes [16, 40]. Last but not least, researchers are often interested in which genes/proteins are important for the given biological process or disease of particular interest rather than the general diseased conditions. Therefore, focusing on hubs of the PPI networks alone might not be the best choice to predict key genes for a given biological process. We need to develop better approaches or integrate more data information to predict or prioritize key genes for a given biological process or a given disease from the PPI networks. For instance, one could adapt the newly developed epidemiology analysis approach, the so-called expected force to quantify the spreading power of a protein, to evaluate the effects of the given proteins because the expected force outperforms other centrality measurement, especially for the non-hub nodes [41].

As discussed above, it is apparent that hub- or centrality-based analysis has weaknesses as do all the other approaches. Nevertheless, Wu et al. have recently successfully demonstrated the value of centrality analysis from the extracted network by integrating the general PPI network, knockout transcriptome, and knowledge on the specific signaling transduction pathway [42]. From the general PPI network, the authors have constructed a PPI network model connecting known proteins of the key Th17-stablizing pathway, that is the IL-23R signaling pathway to the transcription factors that have been dysregulated in murine Il23r−/− cells. They then ranked the network nodes by centrality analysis and identified SGK1 as the top 1 ranked node. Using the sgk−/− mice, they have already successfully validated that SGK1 plays a critical role in the induction of pathogenic Th17 cells. Apparently, the essential reason behind their success in applying centrality-based analysis is not attributable to the general PPI network but to the extracted specific PPI network.

3.2 Closeness-Based Approaches

In the last decade, enormous progress has been made in predicting disease genes from PPI networks using approaches based on the widely-accepted ‘guilt-by-association’ hypothesis [17, 43]. The assumption ‘guilt-by-association’ can easily be accepted because proteins having similar functions or sharing phenotypes have a high chance of interacting with each other [44] or become ‘close’ in terms of distance by certain measurements in the given PPI network. There are different ways to measure proximity of proteins in the given PPI network. These methods can be largely clustered into local and global distance measurements [17]. A well-established local distance measurement is the so-called direct neighbor counting (also known as first-degree neighbor counting) approach, which has at least two variants, the absolute count on the number of given disease genes linked to the candidate and the percentage of the number of given disease genes among all the genes directly linked to the candidate in the PPI network [4]. Initially, researchers employed such an approach to predict protein functions [45]. One of the first successful showcases to predict key genes was performed by Oti et al., in which they predicted disease genes by counting the number of proteins known to be linked to the given disease among the first-degree nodes of the candidate gene [46]. As not all the proteins in the same pathways directly interact with each other, the shortest pathway measurement has been introduced to calculate the closeness between the known disease proteins and the candidate proteins. For instance, as pioneers, Krauthammer et al. once used the shortest pathway approach from the constructed PPI network to predict candidate genes in Alzheimer’s disease [47]. Later on, Guney and Oliva improved the shortest pathway approach by integrating not only the pathway length but also the disease-associated nodes included in the pathway in the PPI network [48]. They showed a high predictive power in prioritizing top genes related to Alzheimer’s disease, diabetes and AIDS.

The global ‘closeness’ measurements such as random walk with restart and diffusion kernel have also been successfully applied to predict disease genes. For instance, Navlakha and Kingsford once demonstrated that random-walk approaches outperformed clustering and neighbor counting methods when predicting disease-related proteins from the constructed PPI network [49]. Interestingly, the comparison of different ‘closeness’ calculation approaches demonstrates that not only do different methods predict some unique novel key genes but also the prediction performance varies a lot among different diseases [49]. Therefore, as already proposed, a consensus approach combining various prediction methods into the ensemble method, the random forest classifier, needs to be used for a higher predictive performance. As it has been observed that the distance between known disease genes/proteins is inversely correlated with the predictive performance [49], the successful prediction of key genes is only favorable for some diseases in which the known disease proteins are already observed to interact closely with others. Such a variance among diseases might be biased by the incompleteness of the current human PPI networks and be backed by current research focusing only on certain diseases. Therefore, given the fact that the current human PPI network is still very incomplete, we need to incorporate further different types of data with the PPI networks, for example genome, transcriptome, proteome, post-protein modifications, and other layers of molecular interactions, and in parallel develop alternative better approaches to predict genes important for a given disease.

A similar challenge to the prediction of disease genes is to infer key genes for a given biological process or function from the PPI networks. As a starting point, one can also use the ‘closeness’ measurement approaches to predict novel key targets from the PPI networks. Compared with disease gene prediction, there are advantages and disadvantages in predicting genes important for a given biological process. On one hand, there is much more information available on the known genes involved in most of the biological processes, indicating a higher predictive power to infer key genes for most biological processes. On the other hand, genes important for many biological processes except for those essential cellular processes, for example RNA processing and protein synthesis, might have even less chances to occupy the hub nodes than those non-lethal disease genes. It is fairly understandable that imbalance caused by dysfunction of some key genes in certain biological processes might be compensated by other pathways or functions. As a result, identifying non-hub but key genes for a given biological process might be even more challenging.

4 Expression Correlation Network-Guided Key Gene Discovery

We would already be able to predict or prioritize successfully some key genes for a given biological function or process, or even a specific disease, if the GRN and PPI networks were virtually complete and accurate for mammals, especially for human beings. However, the available GRN, PPI, and signaling transduction networks are far from being complete, which severely affects the predictive power no matter which advanced computational approaches we utilize. Fortunately, the proposal to utilize expression correlation networks provides us with alternatives [28, 50, 51]. Expression correlation network (shortened as correlation network) is often purely data-driven, being fully based on large-scale transcriptome data by using various ‘distance’ measurements, for example the most-widely used Pearson correlation or the variants partial Pearson correlation [52]. If the correlation coefficient between two genes through various conditions or time points is higher than the given thresholds, a link is assigned between the two genes. A systematic caculation for each pair of genes generates a correlation network [8]. It is not always possible to obtain an accurate and complete GRN, PPI, and signaling transduction network for many types of cells and organisms. However, fortunately, with the decreasing cost of microarray measurements and RNA-seq, obtaining a correlation network is almost always possible for any type of cell or organism as long as genome-scale transcriptional profiling has been performed for a certain number of conditions or time points. Athough undirected, the expression correlation network has its potential, as indicated by the early observation [44] that a high correlation between two human genes has equal or even slightly higher power to predict the possible shared phenotypes than the interacting protein pairs from the highly curated human protein reference databse (HPRD) [53].

As the proteins stably interacting are intended to be coexpressed rather than random pairs [5456], one could use similar approaches as those applied to the PPI networks, for example ‘centralities’ and ‘closeness’-based methods. For instance, we have recently inferred a correlation network that only exist in CD4+ regulatory T cells (Tregs) but not in CD4+ effector T cells (Teffs) to identify novel key genes for the suppressive function of Tregs [57]. The correlation network is predicted from high-time-resolution time-series transcriptome data of either Tregs or Teffs followed by T cell receptor stimulation by using a combination of two different association calculation approaches allowing time shifts using dynamic programming [28, 58]. Before carrying out any deeper analysis, we first examined the connection degrees of each node in the Treg-specific correlation and ranked all the candidates based on a simple centrality measure, namely connection degrees. Excitingly, the top-1 ranked gene, STUB1 (E3 ubiquitin ligase) was later independently shown to mediate the Treg suppressive function using stub1 knockout mice models [59]. Although this gene has never been emphasized in our original publication, the original table is available at our website (http://wwwen.uni.lu/lcsb/publications/databases_networks_tools). Other groups have also investigated the correlation networks in the cancer research fields. For instance, Yang et al. constructed four correlation networks for four cancer types, respectively [60]. Surprisingly, they found that almost all the prognostic genes are depleted among the hub genes, indicating that it might be not so appropriate to predict prognostic genes using hubs from correlation networks. This difference might be not only because of the different research subjects but also because we used time-series information rather than the static measurement they used. Incorporation of time-shift information into the prediction of correlation edges, which might not exist if calculation is solely based on synchronized correlation coefficients, in reality allows the potential candidates to interact with known key genes from different pathways. This in principle could increase the chances of identifying some novel key genes from different pathways as most genes in the same pathway might be coexpressed in a synchronizing way.

To utilize our Treg-specific correlation network better as a hypothesis generator, we have used the naive Bayesian integration approach to incorporate various information such as closeness measure and expression difference to obtain a uniformed score to rank all the candidate genes [57]. To prioritize the candidate genes (Fig. 1) we used the ‘closeness’ measure for the given gene in the correlation network, called the queen-bee-surrounding (QBS) principle, defined as the chance (calculated by cumulative binomial distribution tests) to obtain the number of known potential T cell functioning genes among the total number of first-degree neighbors. We have also calculated the expression difference of the candidate gene between Tregs and Teffs over the measured time period (Fig. 1). Remarkably, six out of the top ten ranked key genes predicted from the Treg-specific correlation network are known to be important for the Treg suppressive function or to play a critical role in autoimmune diseases or disorders, encouraging us to validate further the top-ranked novel key genes. Interestingly, we were able to demonstrate successfully that the novel key gene PLAU plays an important role in Treg suppressor function by using specific antibody against PLAU in human Tregs and PLAU−/− murine Tregs [57]. One of the essential advantages of our QBS approach may lay in its ability to prioritize key novel genes, for example the gene PLAU that does not show a very strong expression difference between Tregs and Teffs, which thus cannot easily stand out by classic differential-expression analysis. As demonstrated here, another advantage of correlation network-based approaches is the ability to predict key genes in the upstream of transcription factors, for example signaling proteins (here PLAU), which is almost impossible by the GRN-based approaches (Table 1). As both centrality- and ‘closeness’-based approaches were able to predict some interesting novel key genes as exemplified in our Treg correlation network, it might be worth checking whether it increases the predictive power by combing both approaches into a consensus. So far, through the comparisons between correlation, GRN, and PPI network-based key gene prediction approaches, correlation network-based methods do not display obvious weaknesses, even if not showing apparent advantages over the others (Table 1). Therefore, for some cases it is beneficial to explore the use of correlation network-based key gene discoveries, as the inference of correlation network is much easier than that of others.

Fig. 1
figure 1

Integrated pipeline abstracted from the original work [57] to predict key genes for a given cellular process in a specific cell type based on correlation networks inferred from ‘omics’ data. Red, representing computation-orientated steps; green, designating experiment-related steps

Table 1 Comparisons between correlation, GRN and PPI network-based key gene discovery approaches

5 Concluding Remarks and Future Directions

The exponentially-increasing amount of genomic and functional genomic data, such as coding and noncoding transcriptome (e.g., from microarray, RNA-seq, and microRNA assays), proteome, metabolome, ChIP-seq, and others, empower us with the possibility to unveil better than ever the cellular components, their interactions, and the resulted molecular networks. The next essential task is to interpret properly and make use of those large-scale datasets. To this end, many excellent computational approaches, some of which we have already briefly discussed here, have been developed to analyze those large-scale datasets. Some of them allow us to infer or reconstruct molecular networks from large-scale ‘omics’ datasets. Following the network reconstruction or inference, some of the reports have already successfully demonstrated the application value of various types of molecular networks. One of the essential but challenging applications for systems biologists is to predict key genes for a given biological process or, eventually, for a specific disease. At first sight, predicting key genes seems to be quite paradoxical because the way we predict key genes almost reverses the route by which we infer networks. In fact, network prediction starts from a huge number of potential genes and one does not know whether and how these genes interact or regulate. However, key gene prediction is to seek and position precisely a few genes or nodes among complex networks with a huge number of nodes and interactions. As we have already discussed above, several network analysis approaches allow us to predict novel key genes which are not easily made to stand out or even detected by classic differential-expression-based statistical analyses. However, we are still a long way off from being able to apply network analysis to predict successfully novel drug targets or novel prognostic or diagnostic markers for a given disease. Possibly because of the extremely long period required for drug development, including all the phases of clinical trials (at least 10 years), to date there has been no successful report on network-based novel drug target identification and subsequent drug development.

To predict more successfully novel key genes for a given cellular process from the inferred or reconstructed molecular networks, we believe the following directions should be further developed:

  1. 1.

    First of all, context- and cell-type-specific network inference needs to be developed further by integrating various condition- and type-specific and various layers of ‘omics’ data into the general molecular networks. The context- and cell-type-specific networks should be highly relevant to human disease gene prediction as particular human diseases are mainly driven by specific cell types, at least down to specific tissues [61]. For instance, it is also only relevant to utilize CD4+ T cell-specific networks rather than fibroblast cellular networks to infer key genes important for CD4+ T cell differentiation and function.

  2. 2.

    We need to infer or reconstruct personalized or at least patient-subgroup-related molecular networks as many patients with the same disease can be further stratified into several subgroups based on their molecular patterns [62, 63]. For instance, responders to the promising tumor immunotherapy checkpoint inhibitors apparently have very different molecular patterns than those non-responders [64, 65]. It only makes sense to predict novel key drug targets from molecular networks inferred from samples of non-responders rather than a general tumor molecular network. Current efforts have already started to predict disease differential networks [66, 67], which might further increase the power to predict some disease key genes. A further step to finely divide the disease-specific networks or disease maps [68] into disease-subgroup networks is needed, as discussed above.

  3. 3.

    A few examples of pioneering work [35, 69] have already been carried out to integrate multiple layer molecular networks to infer novel key nodes for a given biological process or specific disease. Nevertheless, there are still many improvements needed and a few challenging questions to be addressed. For instance, what is the minimal amount of information required for successful key gene prediction? The more data we use, the better the predictive power we can achieve. Possibly, the answer is not that simple, in that many different molecular layers of data measured might generate redundant information, unfortunately introducing more noise to the hide true signals. Another quite new and challenging problem is how to integrate various layers of molecular networks and multilayer ‘omics’ data to predict novel key targets, as in our body all these layers of networks intertwine together per se.

We are very confident that, with the further development of computational and experimental techniques in revealing cell-specific and patient-specific cellular networks, we can soon better predict and efficiently disclose novel key genes or drug targets mediating any given biological process or function or a particular disease by means of a combination of data-driven and knowledge-driven network approaches, rather than classic trial-and-error methods. The network-based strategy to discover key genes for a given cellular process should be the next driving force to revolutionize biomedicine research, complementing the well-established GWAS or QTL-based approaches or the newly-developed resource-intensive systemic phenotyping of murine mutants which, although with clear caveats [70], have already substantially leveraged the systemic identification of key genes or loci associated with a given disease or phenotype or symptom [7176].