Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Background

With the advent of new molecular profiling techniques, genome-wide datasets that describe interactions between molecular entities (i.e., mRNA, proteins, metabolites, etc.) are being generated at an ever increasing pace. These datasets each measure a specific type of interaction that is active under certain conditions within the cell or that occurs as a response to specific environmental signals. The distinct nature of these datasets often brings about complementary views on cellular behavior. A network-based representation of various biological systems captures many of the essential characteristics of these data and integrating complementary molecular interaction layers into a single network thus provides a way of representing our knowledge of the system as a whole. Application of well-established tools and concepts developed in fields such as graph theory on such networks can provide valuable insights into the systemʼs mode of action and functionalities [19.1]. The identification of motifs, which are statistically significant reoccurring characteristic patterns in a network [19.2], has for instance shown that specific types of motifs carry out specific information-processing functions within cells [19.3].

An integrated interaction network can also reveal previous unknown insights in biological processes or functional behavior by explicitly interrogating it with independent functional data sets. Methodologies that identify and explore paths in networks between given input and output nodes have gained much interest. Such a path in a network can be seen as a mechanistic representation of the way information propagates through the network. Identifying biologically meaningful paths in the network between nodes of interest, nodes which can be defined from functional data sets that are independent from the network itself, can unveil previously uncovered signal flow mechanisms that are responsible for the observed functional behavior or define a measure for relatedness of two nodes in the network.

In this chapter, we highlight diverse omics datasets and data-integration techniques that can be used to build integrated biological networks and discuss several categories of network-based path finding methodologies that aim at obtaining a more functional understanding of cellular behavior.

Inferring Interaction Networks from Omics Data

Understanding the cellular behavior from a systems perspective requires the identification of functional and physical interactions among diverse molecular entities in a cell (i.e., DNA/RNA, proteins and metabolites). The most straightforward manner of capturing interactions between molecular entities is by representing them as an interaction network. Here, molecular entities are represented by nodes and the interactions between them by edges. In this section, we elaborate on different types of networks that can be constructed from omics data and present supervised learning strategies to assign reliabilities to interactions.

Network Representations

Classically, a distinction is made between a functional network in which nodes usually correspond to proteins or genes, and edges represent functional relations between the nodes and a physical network where edges represent direct physical interactions (Fig. 19.1). Proteins connected in a functional network can be interpreted as being active in the same pathway or being needed together to mediate a specific function, but they do not necessarily physically interact. Examples of specific functional networks are, for instance, genetic interaction networks and coexpression networks. Within a physical network, different molecular layers can be distinguished: intracellular signal transduction, which transmits information from the surface to the nucleus, for instance by means of protein phosphorylation, and protein interactions that propagate this signal. In addition, (post-) transcriptional regulation processes comprise transcription factor (TF) proteins or sRNAs regulating the expression of genes and finally metabolic reactions catalyzed by enzymes, where metabolites are converted into energy and building blocks. Each of these different layers in a physical network can be deducted from their own specific datasets and will have their own characteristics.

Fig. 19.1
figure 1

Overview of molecular networks that can be inferred from omics data. (a) Functional networks: molecular entities represented by nodes in the network share a functional relation that does not require their physical contact. Genetic interaction network: edges reflect the phenotype that is observed when both nodes (genes) are inactivated simultaneously (double mutant). Coexpression network: nodes correspond to genes and edges represent the mutual similarity in expression profiles between connected nodes. (b) Physical interaction networks. Physical contact occurs amongst members (nodes) of the network. Signaling network. Nodes are proteins and edges represent signaling events (e.g., phosphorylation). Protein interaction network. Edges represent physical interactions between proteins represented by the nodes. (Post-)transcriptional network. Nodes represent either regulators or target genes and the directed edges reflect the physical regulator (transcription factor, sRNA)-target interactions. Metabolic network. Edges correspond to metabolic reactions catalyzed by enzymes represented by the nodes

Overview of Different Datasets

Small-scale laboratory experiments alone are impractical for creating a genome-scale network of different types of interactions, mainly for reasons of cost and time. Recently, advances in experimental methods made it possible to generate interaction datasets in a high throughput manner. Such datasets, like for instance, protein–protein interactions (PPI), have been generated for model organisms such as Saccharomyces cerevisiae [19.4,5,6,7,8], Caenorhabditis elegans [19.9], and Drosophila melanogaster [19.10,11], as well as Homo sapiens [19.12,13] by genome-wide yeast two hybrid (Y2H) screens and large-scale affinity purification/mass spectrometry. Technologies such as ChIP-chip and ChIP-seq, make it possible to measure the TF-DNA interactions at a genomic scale [19.14,15,16], and mass spectrometry (MS)-based proteomics have enabled the large-scale mapping of in vivo phosphorylation sites [19.17].

Due to the large flood of experimental interaction data becoming available, several efforts have been made to store and centralize these datasets through the construction of databases. Some databases capture data about a specific organism or research topic, like, for instance, transcriptional regulation, while others integrate data from specific organisms and/or different interaction types in a standardized manner. Gradually more and more specific databases are merged into these integrated databases. Table 19.1 gives an overview of some frequently used databases that provide physical interaction data for several eukaryotic model organisms, categorized by the type(s) of data they provide.

Table 19.1 Overview of databases containing physical interaction data for eukaryotic model organisms

The fragmentation of data over a rising amount of databases [19.33] makes an integrated and comprehensive use very difficult. To reduce this problem, Bader, Cary, and Sander [19.34] provide an extensive overview of several databases (at the time of this writing this number was equal to 328) spanning different interaction layers across a multitude of organisms in a meta-database named Pathguide.

Assessing the Confidence in Interactions Derived from High-Throughput Experimental Data

Experimental datasets generated by high-throughput methods like Y2H or ChIP-chip are not only prone to high rates of false positive interactions [19.15,35], low overlap between them also indicates that the current interaction maps are far from complete [19.36]. Assessing the quality of the data obtained is useful both for deciphering the correct molecular mechanisms underlying given biological functions, and for intelligent future experiment design [19.37].

Von Mering et al. [19.38] addressed the problem of extracting highly confident interactions between proteins from high throughput data sources by using the intersection of direct high throughput experimental results. Although they were able to achieve low false positive rates, the coverage in the number of retrieved interactions was also very low. Increasing the coverage of a network is especially useful for humans since the protein interaction map, for instance, was estimated to be only 10% complete [19.36]. Predicting interactions augments the current knowledge of the relationship between distinct cellular processes and underlying mechanisms of diseases.

Several machine-learning methodologies were suggested to assign reliabilities to interactions identified by experimental data, as well as to predict de novo interactions. In Sect. 19.2.2, we, therefore, introduce the concept of supervised learning to assess interactions.

Integration Frameworks Based on Supervised Techniques to Predict Interactions

Supervised learning methods [19.39] infer a function from a training set. Such a training set consists of pairs of input vectors and their corresponding known output. When the output is discrete, the supervised method is called a classifier. The learned function between input and output can then be used to predict the output of any valid input of which the output is not yet known. Applied here, a classifier would exploit known interactions to infer novel interactions from omics data. They can learn the set of data characteristics (features) that allow distinguishing true from false interactions from a set of known interactions (training set). A novel interaction is then predicted to be true or false, depending on the extent to which it shares similar features with the interactions in the training set.

Since classifiers are commonly used to assess the confidence in interactions from omics data, some guidelines for the choice of features and training sets are discussed in the next section, as well as two types of classifiers that are frequently used to stratify many candidate interactions by confidence or predict novel interactions, namely Bayesian approaches and logistic regression. A short case study in predicting PPI and functional interactions is presented in the last part of this section, as an enormous amount of high throughput experimental data for PPI is nowadays freely available in several databases. However, the integration process is general and can be used for assessing interactions at other network levels as well, using the standard frameworks described below, together with a set of features and training set that is specific for the dataset being assessed (e.g., [19.35,40] for assessing TF-DNA interactions obtained from ChIP-chip).

Features

The set of features provided to the classifier are measurable entities or evidences that characterize an interaction or a noninteraction. The classifier then learns which of the provided features are predictive for the interactions at hand. Such measurable entities can be direct information (e.g., the interaction was seen in an experiment) or indirect information (e.g., the expression correlation of two proteins could indicate that they are members of the same complex). Examples for predicting PPI are, for instance, network topology-based features [19.41] and GO biological process similarity [19.42,43], amongst others. Nucleosome occupancy [19.40], DNA binding motifs [19.35,40], and shared phylogenetic profiles (i.e., occurrence of the interaction in multiple species) [19.35] have been shown to be predictive for TF-DNA interactions. Coexpression between genes [19.35,40,42,43,44,45] has been used both for predicting TF-DNA and PPIs.

Training Sets

The prediction quality of a classification scheme stands or falls with the choice of a golden standard training set. This training dataset usually consists of positive and negative examples and is used to discover a predictive relationship between several features and the positive and negative examples.

An ideal golden standard should be independent of the data sources serving as features, sufficiently large for reliable statistics, and free of systematic bias [19.42]. Moreover, the choice of training set also depends on the prediction task at hand: positive and negative examples should reflect the same entities as the ones one would like to predict. This means, for instance, that a golden standard for predicting protein complexes should consist of proteins belonging or not belonging to the same complex, while positive and negative examples for predicting a functional network, on the other hand, should reflect functional and nonfunctional relationships, respectively.

A set of positive examples is usually based on a curated, literature-derived dataset, containing only high-confidence interactions. For predicting physical protein interactions, a high quality subset of the Database of Interaction Proteins (i.e., DIP [19.23]) discovered by small-scale experiments or data from individually performed experiments listed in The Munich Information Center for Protein Sequences (i.e., MIPS [19.25]) can, for instance, be applied. TF–DNA interactions from the Incyte YPD database [19.30] could serve as a positive set for transcriptional interactions in yeast [19.35,40]. Positive examples of functional relations between proteins can be extracted from the gene ontology (i.e., GO [19.46]) database annotations. Usually, proteins are considered functionally related if they share a specific biological process GO term (e.g., contain less than 200 annotations [19.47]).

A good set of negative examples is harder to define, since noninteracting pairs cannot be observed. Negative training sets can, for instance, consist of randomly combined pairs [19.35], 44], randomly observed interactions in a high throughput dataset [19.45,48], or, in the case of protein interactions, proteins occurring in different subcellular components [19.42,43,44], and proteins not sharing any specific GO term [19.47,49,50,51,52,53].

Bayesian Approaches

Different sources of evidence can be probabilistically combined to predict interactions using Bayesian formalism. This learning framework allows for combining highly dissimilar types of data in a model that is easy to interpret and that can readily accommodate missing data.

The posterior odds of interaction between two molecular entities (O post) represent the probability that an interaction occurs given the presence of several genomic features, divided by the probability that such an interaction will not occur given the presence of these features. This can be formalized using Bayesʼ theorem,

O post = P ( I | f 1 , , f N ) P ( ~ I | f 1 , , f N ) ,
(19.1)
= P ( f 1 , , f N | I ) · P ( I ) P ( f 1 , , f N ) P ( f 1 , , f N | ~ I ) · P ( ~ I ) P ( f 1 , , f N ) ,
(19.2)
= P ( I ) P ( ~ I ) · P ( f 1 , , f N | I ) P ( f 1 , , f N | ~ I ) ,
(19.3)
= O prior · LR .
(19.4)

The posterior odds of an interaction can thus be calculated as the product of the prior odds (O prior) of interaction and the likelihood ratio (LR) of an interaction (19.4).

The prior odds of interaction are defined as the probability of encountering an interaction among all pairs, divided by the probability of observing no interaction between a pair. The likelihood ratio represents the probability of observing the values in the predictive datasets given that a pair of molecular entities interacts, divided by the probability of observing these values given that the pair does not interact.

A naive Bayesian classifier makes the assumption that the genomic features (denoted by f 1,… , f N ) are independent. In this case, the LR can be calculated as the product of the individual likelihood ratios from the respective genomic features (19.5)

LR = Π i = 1 N [ P ( f i | I ) P ( f i | ~ I ) ] .
(19.5)

The likelihood ratio for every genomic feature can be estimated by counting the frequency of occurrence of interacting and noninteracting pairs in the golden standard that possess a particular value of the feature.

In the case of features with correlated evidence, the likelihood ratio cannot be factorized in this way and all possible combinations of all states of the features must be considered, which can be computationally intensive. The prior odds are more difficult to assess, since not all true interactions are known. For PPI this parameter, for instance, be estimated by examining the average number of interactions per protein for which all known interactions have been identified in the literature [19.43,44].

After deriving likelihood ratios for independent features from the golden standard, the likelihood ratio for every protein pair can be determined by combining the likelihood ratios for every independent evidence source [19.43,44]. An interacting pair is then predicted as positive if its likelihood ratio exceeds a certain cut off [19.42,49,51].

Logistic Regression

A logistic regression is a generalized linear model that is used to calculate the probability of the outcome of an event, e.g., the probability of observing an interaction between two proteins. The relationship between the response variable (e.g., observing an interaction (= 1) or not (= 0)) and the predictor variables (i.e., genomic features and/or experimental observations) is given by a logistic function

P ( I ) = 1 1 + e - ( β 0 + β 1 f 1 + + β N f N ) .
(19.6)

The logistic function can take as an input (i.e., evidence features f 1,… , f N ) any value from negative infinity to positive infinity, whereas the output (i.e., probability of an interaction P(I)) is confined to values between 0 and 1. Parameters β 1,… , β N can be estimated by using a set of positive and negative interaction pairs as output (yielding output values of 1 and 0, respectively) and their corresponding features as input in a maximum likelihood approach [19.39]. The estimated parameters can then be used together with evidence features corresponding to an interaction between two molecular entities to predict the probability that these entities truly interact.

Case Study: Inferring Protein Interaction Networks from Omics Data

Examples of PPI and Functional Networks

Using the Bayesian framework or slight variations of it and specific sets of genomic features or experimental datasets, both functional networks for S. cerevisiae [19.49,50], C. elegans [19.54], human [19.53], Arabidopsis thaliana [19.52], mouse [19.47] and protein-complex [19.42], and PPI networks [19.43,44] for human were developed.

Methods based on logistic regression have been used to assess the confidence of interactions observed in experimental data [19.41,45,48] by integrating experimental information, topological measures and/or expression correlation, and have been used in several path finding approaches to assign a confidence score to the PPI in a yeast network [19.55,56,57,58] or a human network [19.59], and to assess the reliability of experimentally determined protein interactions in D. melanogaster [19.11].

Performance

Many supervised classification methods have been developed to integrate direct and indirect information on protein interactions. They each differ in the collection of integrated data sources, approach, and implementation. Qi et al. [19.60] independently investigated the performance of different classifiers and the importance of different biological datasets, together with various golden standards. They concluded that a classifier based on Random Forests performed best among the classifiers, followed by a logistic regression. However, Suthram et al. [19.61] assessed the performance of six approaches, each with their own combination of features and classification method, and showed that a rather complex approach based on Random Forests [19.62] had lower overall performance compared to other methods tested. Both authors could conclude that including many input variables does not necessarily result in a better prediction performance, and in some cases even the opposite can be true. However, utilizing any probability scheme turned out to be better than considering all interactions observed to be true or equally probable.

Using Interaction Networks to Interpret Functional Data

Nowadays with large amounts of interaction data being generated, genome-wide networks can be created for an increasing number of organisms. These networks can be exploited to study a molecular entity like a protein in a wider context than just in isolation. However, the inferred physical networks are static and do not reveal which parts of the networks are active under certain conditions and how perturbations are propagated through the network. Integrating physical interaction networks with functional data, like gene expression data makes it possible to reveal relevant active paths or substructures in the network.

High-throughput techniques now allow genome-wide views of the molecular changes that occur in cells as they respond to stimuli. However, data derived from these high-throughput techniques unveil that our understanding of cellular systems is still fragmentary, even of well-characterized model systems. In humans, for instance, only about 30–40% of all differentially expressed genes for transcription factors NF-kB and STAT1 appear to be direct targets [19.63]. Yeger–Lotem et al. [19.64] observed that the results of genetic screenings (i.e., identifying genetic hits, or genes whose individual manipulation alters the phenotype of stimulated cells) and mRNA profiling (i.e., identifying differentially expressed genes following stimuli) often hardly overlap and provide a limited and biased view of cellular responses.

Exploiting the network structure can help in gaining a comprehensive picture of the functioning of a cell, by providing a mechanistic explanation that links the observed effects to the perturbation or cause exerted. Such an underlying mechanism is unlikely to be discoverable when looking at all datasets separately. Several approaches exist for mining the information embedded in integrated networks, dependent on the specifics of the problem statement. Network clustering strategies for instance, which search for highly connected subnetworks, have been successfully used to distinguish cancer-causing mutations from neutral mutations [19.65] or to assess the structure of the yeast genetic interaction network, revealing insights in gene function and modular organization [19.66].

In this chapter we focus on different approaches that have recently been developed to reveal the functional state of a network or to find an explanation for the observations in functional data through paths in the network. A simple path in a network is illustrated in Fig. 19.2b and is defined as a collection of edges that connect a source node (i.e., gene causing an effect or input gene of interest) and a target node (i.e., affected gene or output gene of interest) in an interaction network, such that each selected edge is connected to one other selected edge and the information spread by the source node can reach the target node without interruption. A path can be a collection of simple paths, containing several branches connecting a source with a target node (Fig. 19.2c). There is no further constraint that the nodes within a path should be densely connected to each other, which would refer to a cluster in a graph and would comprise a different problem statement.

Fig. 19.2
figure 2

Definition of a path (a) example of an interaction network, with an input node (red node), output node (blue node), and other interacting entities (green nodes) mapped on the network. Arrows between nodes represent directed interactions, lines between nodes represent undirected interactions (b) a simple path from input to output node is highlighted in the network, dashed lines, arrows, and transparent nodes do not belong to the selected path (c) a branched path from input to output node, containing multiple simple paths is highlighted in the network. Dashed lines, arrows, and transparent nodes do not belong to the path

In this second part of the chapter, an extensive overview of path finding methodologies, illustrated with several applications, is given. Different approaches are categorized according to the underlying goal they try to accomplish. These goals are represented in an abstract way in Figs. 19.319.6, and are further clarified at the beginning of each category.

Fig. 19.3
figure 3

Connecting one or several causes to their effect(s) by unveiling the underlying active paths. (a) Example of an interaction network on which a known causal gene (red node or input) and its affected gene (blue node or output) were mapped. The underlying path responsible for transferring the information from input to output is unknown (blue dashed line). (b) The underlying mechanism that explains the observed effect is highlighted in the network, dashed lines, arrows, and transparent nodes do not belong to the selected path. Arrows between nodes represent directed interactions, lines between nodes represent undirected interactions

Fig. 19.4
figure 4

Integrating cause–effect pairs to confidently infer edge attributes on the network (a) example of an interaction network on which different known cause (red nodes)–effect (blue nodes) pairs were mapped. The type of observed effect is also taken into account: a regular arrow represents an activating effect from input to output; a cut arrow represents an inhibiting effect from input to output. The underlying path responsible for transferring the information from input to output is unknown (blue dashed line). Also the type of effect (i.e., activating or inhibiting) for each edge on the path must be inferred consistently by making use of the cause–effect pairs (b) the underlying path that explains the observed effect is highlighted in the network, whereby also a type of effect to each edge in the path is assigned. Dashed lines, arrows, and transparent nodes do not belong to the selected path, the blue dashed lines show the observed cumulative effect from input to output as in (a). Arrows between nodes represent directed interactions, lines between nodes represent undirected interactions

Fig. 19.5
figure 5

Identifying (an) unknown causal input(s) for (an) observed effect(s). (a) Example of an interaction network on which several candidate causal genes (red nodes or possible inputs) and an affected gene (blue node or output) were mapped. The underlying path responsible for transferring the information from input to output is unknown (blue dashed line). Also the most likely input for the observed output should be identified (red question marks). (b) the most likely causal gene (red node) together with the underlying mechanism that explains the observed effect (blue node) is highlighted in the network, dashed lines and arrows and transparent nodes do not belong to the selected path. Arrows between nodes represent directed interactions, lines between nodes represent undirected interactions

Fig. 19.6
figure 6

Identifying network entities related to network entities of interest. (a) Example of an interaction network on which a known disease-related gene (red node and full blue line to the disease) and candidate disease-related genes (orange nodes and dashed blue lines to the disease) were mapped. To infer the most likely candidate disease-gene(s), their relatedness to the known disease-related gene (dashed orange lines) is examined through interactions on the network. (b) The most likely candidate disease-related gene and its relation to the known disease-related gene is highlighted on the network. Dashed lines and arrows and transparent nodes do not belong to the selected paths

Connecting One or Several Causes to Their Effect(s) by Unveiling the Underlying Active Paths

The common objective of methods described in this paragraph is to reveal the underlying pathways transmitting a signal from one or several causes to their corresponding observed effect(s) by adopting a network-based approach (Fig. 19.3). The cause could, for instance, be a membrane protein and the observed effect a DNA binding protein that receives the signal, but the intermediate molecular interactions through which the signal was transduced from cause to effect is unknown.

Several of the methods developed for this purpose use, in addition to the given cause and effect pairs, other functional data like gene expression to extract biologically relevant paths from the network. This either by using the extent to which a network node is differentially expressed as an indication of its contribution to a plausible signaling path [19.64,67] or by using a measure of expression correlation between edge nodes [19.55,68], between edge nodes and the source and target nodes [19.69], or between edge nodes and target node [19.70] to indicate the confidence we have in an edge contributing to a causal path.

The reconstruction of signaling pathways by overlaying PPI data with cause–effect pairs has received a great deal of attention. Steffen et al. [19.71] were one of the first to model simple paths of a specified length through a physical protein interaction network, starting at a membrane protein and ending on a DNA binding protein in a procedure called NetSearch. Paths were ranked based on a statistical scoring metric, reflecting how many path members clustered together according to their expression profiles. Simple paths that had common starting points and endpoints and the highest ranks among each other were then combined into the final model of branched networks.

In reality, simple paths cannot capture the full complexity of signaling pathways since there may be multiple interaction paths within a pathway. Scott et al. [19.55], therefore, adapted the color coding technique and allowed the identification of more complicated substructures such as trees and series-parallel graphs. A number of candidate paths are firstly found with a score assigned to each candidate and the top scoring paths are then assembled into a signaling network. Lu et al. [19.69] extracted nonlinear path structures from the network and potential interactions between related paths were taken into account.

However, most of these methods generally cannot directly find a signaling network as a whole, i.e., they first identify separate paths and then heuristically assemble them into a signaling network. Other approaches like those of Zhao et al. [19.68], Yosef et al. [19.58], Yeger-Lotem et al. [19.64], and Ren et al. [19.70] infer active paths immediately as a subnetwork from the whole network. The methods have in common that they try to explain cause–effect pairs in a particular set of experiments by solving an optimization problem which typically balances the reliability of the edges used by the length and complexity of the possible paths. A third category of methods uses the frequency of occurrence of a path with a predefined form (i.e., a motif) in the network to explain cause–effect pairs on a more statistical basis. An example of this category is the method of Joshi et al. [19.72], which is discussed in more detail in the case study at the end of this section.

The majority of these methods use one or more MAP kinase signaling pathways involved in pheromone response, filamentous growth, maintenance of cell wall integrity, and high osmolarity as their benchmark, since these are among the best studied signaling networks. These pathways are activated by G protein-coupled receptors and characterized by a core cascade of MAP kinases that activate each other through sequential binding and phosphorylation reactions. A method comparison performed by Zhao et al. [19.68] demonstrated that most methods can to a large extent uncover the known signaling paths, which confirm the effectiveness and prediction power of the approaches. On the other hand, the results also show that there is no single method that can perform the best in all cases, and different models are complementary to each other.

While previous methods concentrate on the reconstruction of signaling cascades between a membrane receptor protein and a target protein, Yeger–Lotem et al. [19.64] focus on identifying molecular interaction paths connecting several related genetic hits (sources) and differentially expressed genes (targets), revealing the underlying response pathways. They hypothesize that some of the genetic hits, which are enriched for regulators of cellular response, will be connected via regulatory paths to the differentially expressed genes, which are the outputs of such paths, via components of the response that were not detected by either the genetic or the mRNA profiling assays themselves. To identify these undetected path components, the authors developed a flow algorithm called ResponseNet. Huang and Fraenkel [19.67] reconsidered this problem by taking into account that both the input data from experimental observations and the interactome can contain noise. They treat the goal of connecting data as a constraint that is attempted to be satisfied through an optimization procedure, resulting in a subnetwork that contains mainly reliable edges while excluding possibly false positive source or target nodes.

Other application examples of this problem formulation can be found in the reconstruction of metabolic pathways [19.73], connecting a source metabolite to a target metabolite, and the reconstruction of transcriptional regulation [19.74], connecting regulators to their target module consisting of coexpressed genes.

Case Study

Previously mentioned techniques do not search for general mechanisms or path structures that are common between different cause and effect pairs, nor include a significance analysis that assesses the statistical significance of the inferred paths. In this way, it is difficult to assess if a given network model truly reflects underlying regulation mechanisms or appears just by chance due to the inevitable noise inherent in the perturbation data as well as in the physical interaction networks.

Joshi et al. [19.72], therefore, propose an alternative strategy by searching for regulatory path motifs in an integrated transcriptional (TRI), protein–protein (PPI) and phosphorylation (PhI) interaction network. Regulatory path motifs are defined as paths of length up to three, which connect a causative gene (for example a transcription factor) to a set of effect genes which are differentially expressed after perturbation of the causative gene, and occur significantly more often than expected by chance in an integrated physical network. The method was tested by searching regulatory motifs between 157 deleted [19.75] and 55 overexpressed [19.76] TFs in S. cerevisiae, together with their corresponding differentially expressed genes.

The significance of the regulatory path motifs is determined by a randomization strategy: the cause and effect pairs were permuted for 10000 times, keeping the number of perturbed genes for each transcription factor constant. Next, the frequency of occurrence of each path motif in the randomized data sets was calculated. If the number of paths in the real perturbation data lies at the right tail of this random distribution (using a z-test statistic), the path was considered significant.

Out of all possible paths of length up to three, the algorithm identified eight regulatory path motifs, of which five where enriched in both deletion and overexpression data. These eight motifs explain 13% of all genes differentially expressed in deletion data and 24% in overexpression data, a more than five to tenfold increase compared to using directional transcriptional links only, confirming that perturbational microarray experiments contain mostly indirect regulatory links.

Like static network motifs [19.2,77,78], regulatory path motifs were found to aggregate into modular structures where the differentially expressed targets of a transcription factor reached by the same path through the same intermediate nodes, form a module. Many path modules showed a high coexpression and were overrepresented in a particular functional category, validating the biological relevance of the regulatory path motifs.

This approach is not only more likely to reflect reflect general regulatory strategies used in biological networks, but also the specificity of a TF to a particular regulatory path can hint towards its mode of action. It was found for instance, that 75% of the genes being perturbed after MET4 overexpression, can be explained by TRI, PPI-TRI, and PPI-TRI-TRI motifs, indicating that MET4 acts together with different combinations of auxiliary factors. In addition they observed that many network motifs that were significantly enriched in response to DNA damage in yeast were shorter than those enriched during cell cycle, exemplifying that environmental responses prefer fast signal propagation while developmental processes progress through multiple stages of interconnecting TFs [19.72,79]. Thus regulatory path motifs can be used to characterize the condition-dependency of the response mechanisms across multiple integrated networks.

Integrating Cause–Effect Pairs to Confidently Infer Edge Attributes on the Network

While previous approaches limit themselves to inferring common underlying paths connecting related cause–effect pairs, others integrate several (even unrelated) known cause–effect pairs in order to assign specific attributes to the interaction graph (Fig. 19.4). Such attributes are, for instance:

  1. 1.

    The presence or absence of an edge in the path connecting cause and effect

  2. 2.

    The regulatory effect of a node, i.e., activating or repressing

  3. 3.

    The direction of information flow through an edge.

By integrating several cause–effect pairs more assignments of attributes can be made than when considering each pair in isolation. This because the assignment of attributes explaining a particular cause–effect pair should also be able to explain causes and observed effects that occur downstream of this pair in the network. Or in other words, the objective is trying to explain as many cause–effect pairs as possible such that the biological constraints on the network are consistent.

The inferred models of Yeang et al. [19.80], called physical network models, are annotated molecular interaction graphs. In this framework the presence of a certain edge in the physical network, the directionality of signal transduction in PPIs, and the regulatory effect of the interaction are determined if their combination is able to explain observed differential expression upon single gene knockouts. These strategies where further explored to investigate the mechanisms of the coupling between regulatory and metabolic networks [19.81]. The MTO algorithm of Medvedovsky et al. [19.82] and the method of Gitter et al. [19.83] limit themselves to determining a single direction for each edge, so that a maximum number of pairs have a directed path from the cause to the effect.

Case Study

SPINE [19.57] improved the physical network models [19.80] by assigning an activation/repression attribute with each protein so as to explain (in expectation) a maximum number of knockout effects observed. They do not explicitly model the direction of the edges, but most PPIs appeared in one direction only in the inferred consistent pathways. The goal of the algorithm is to infer regulatory pathways in the network that provide a consistent explanation for the input set of knockout pairs.

A path is a consistent explanatory path (Fig. 19.4b) if:

  1. 1.

    The aggregate sign of the path is equal to the observed expression direction (upregulated or activated versus downregulated or inhibited)

  2. 2.

    If every subpath connecting another knockout pair is also consistent.

The optimization problem is defined as that of finding an assignment that will maximize the expected number of pairs that have at least one consistent path, given by

E ( ( s , t ) X K s , t ) = ( s , t ) X E ( K s , t ) = ( s , t ) X p ( K s , t = 1 ) ,
(19.7)

where K s,t is a variable that indicates if there exists at least one regulatory path consistent with a knockout pair (s, t) out of a collection of knockout pairs X; p(K s,t  = 1) corresponds to the probability that at least one consistent path exists for knockout pair (s, t). The optimization problem is reformulated and solved as an integer linear program.

The authors evaluated their method by applying it on a genome-wide integrated yeast network consisting of PPI and TF-DNA data, in order to explain the effects observed in gene expression under different single-gene knockouts [19.84]. Here, a significant overlap between the modelʼs prediction and the known signs was seen. Moreover, increasing the path length from one edge (i.e., only a direct TF-DNA link) up to three edges (i.e., one TF-DNA link and two other PPI/TF-DNA links) in different runs clearly showed the importance of looking at paths rather than considering direct edges only, since the amount of explained knockout pairs increased accordingly.

Identifying (an) Unknown Causal Input(s) for (an) Observed Effect(s)

In this section, we discuss applications where several effects are observed, but the true cause of these effects is unknown (Fig. 19.5). The common objective here is thus to infer this unknown cause or causes by use of a network. Examples of such problems can be found in the domain of expression quantitative trait loci (eQTL) mapping. With the availability of complete genomes of single strains, identifying which alterations in the DNA sequence (i.e., causes) are responsible for observed changes in gene expression (i.e., the quantitative trait, or effects) becomes increasingly important. Usually, in a first step the association between a geneʼs expression level (i.e., eQTL mapping) and each genomic region (i.e., the expression quantitative locus or eQTL) is examined by a statistical method [19.85]. In the case of multifactorial traits, multiple loci can be associated with the geneʼs expression behavior, which complicates eQTL analysis. In addition, due to linkage disequilibrium each of the associated loci can contain several genes, which limits the localization of the true causal gene. Even when the causal gene can be identified, the molecular mechanism through which the association is exerted often remains elusive [19.86].

Path finding methods can here be used to identify the causal gene within an associated genomic locus and the underlying pathways that transmit signals from the locus to the affected target. The genes that altered their expression levels are considered as effects, while all the genes in the associated loci are defined as possible causes.

Tu et al. [19.87] proposed a random walk approach to infer the causal gene in a locus and the underlying pathways from a physical interaction network consisting of protein phosphorylation, PPI, and TF-DNA interactions. They assumed that the pathway starts with one causal gene in an associated locus and ends at the transcription factors regulating the target gene such that the expression of the genes on the pathway are correlated with the target gene. For each affected gene and each of its associated eQTLs their stochastic algorithm is performed separately to identify the causal gene. During the walks initiated on the network, different genes will be visited with different frequencies depending on their expression profile. The genes with higher frequencies are then assumed to be more likely to be the causal gene, and the most frequently traveled paths are regarded as the underlying regulatory pathways. For 239 out of 585 eQTLs identified in a study of 112 yeast segregants of Brem et al. [19.88] a causal gene could be significantly predicted. The authors highlighted GPA1 as causal regulator for target gene PRP39, a result that was experimentally verified by Yvert et al. [19.89].

Suthram et al. [19.86] further adapted the method of Tu by considering the analogy between random walks and electric circuits in a new method, named eQTL electrical diagrams (eQED). eQED models the flow of information from a locus to a target gene as electric currents through the protein network. The authors consider all loci influencing the target simultaneously, allowing multiple loci to reinforce each other when they fall along a common regulatory pathway. The causal gene in each locus is then predicted as the one with the highest current running through it. By validating the eQED model on the eQTL data set of Brem and Kruglyak against a golden standard of knockout expression profiles, the multilocus model indeed showed a highly improved accuracy compared to the single-locus model and the method of Tu et al. (80% versus 72% and 50%, respectively).

Case Study

Inspired by the eQED electric circuit model, Kim et al. [19.90] developed a method for the identification of candidate causal genes and dysregulated pathways that are potentially responsible for the altered expression of target genes associated to glioblastoma multiforme (GBM), the most common and most aggressive malignant primary brain tumor in humans. Applied on gene expression and genomic alteration (in this case copy number variations or CNVs) profiles of 158 GBM patients, their methodology comprises four steps:

  1. 1.

    In a first step a set of genes is selected that show differential expression in the patients while taking into account disease heterogeneities among different patients, thus extracting sets of differentially expressed genes that are specific to a subgroup of patients. These differentially expressed genes are hereafter called target genes.

  2. 2.

    Subsequently, an eQTL mapping is performed by a linear regression analysis to determine the association between the expression of each target gene and copy number alterations of tag loci. A liberal p-value was chosen to retain most of potentially interesting relationships.

  3. 3.

    Then, to filter out false positives and to determine the most likely causal genes within each region of associated CNVs for each target gene, a physical network-based approach based on the electric circuit diagrams of Suthram et al. [19.86] was applied. Each node in the circuit represents a gene and holds a certain voltage to be determined. Each edge represents an interaction between node entities and has a conductance (i.e., how easily electricity flows along a path) defined by the mean expression correlation of its nodes with the target gene. As such, the authors ensured that a single noncorrelated node reduced but not completely interrupted the current flow, while a cluster of noncorrelated nodes put a considerable resistance to the current flow. Using Ohmʼs law and Kirchhoffʼs current law, the amount of current through a node was calculated. Candidate causal genes for each target gene where then selected based on a permutation test to estimate the statistical significance of the current flow through the nodes.

  4. 4.

    Finally, this resulting set of causal genes was further reduced by imposing another filter: a minimum set of causal genes was selected that could explain all disease cases except for a few outliers.

Assessing the significance of the identified set of causal genes by determining overlap with sets of known GBM/glioma specific genes showed that their approach could uncover more cancer relevant genes than a simple association approach and demonstrated the increased predictive power of the model.

The authors also assessed the importance of genes in the paths from putative causal genes to their target genes and observed the emergence of hubs, genes that appeared in a disproportionally large number of paths. Such a set of hubs contained important transcription factors such as MYC and E2F1, and oncogenes such as JUN and RELA, and was enriched in genes that appeared in cancer pathways, the cell cycle, and several important signaling pathways. While such hub genes were clearly related to cancer, they would hardly have been identified by analyzing differentially expressed genes alone, demonstrating the advantages of a pathway-based approach.

Moreover, a GO biological process enrichment analysis of the uncovered subnetworks revealed frequently re-occurring classical cancer related pathways like insulin receptor signaling pathways, RAS signaling, as well as a glioma-associated regulation of transforming growth factor-b2 production and SMAD pathway. Such pathways can then be considered as GO biological process hubs or highways, connecting many different causal genes with their targets. Such an observation supports the hypothesis of a pathway-centric view of complex disease, namely that many different genomic alterations potentially dysregulate the same pathways in complex diseases.

Among the discovered set of putative causal genes and pathways, an influence of PTEN and CDC2 was observed on the expression of WEE1 through transcription factors TP53 and E2F4. This tyrosine kinase in turn phosphorylates the protein product of CDC2 (i.e., CDK1), a signaling event that is crucial for the cyclin-dependent passage of various cell cycle checkpoints and suggested as an important feedback mechanism for cancer by the authors.

Identifying Network Entities Related to Network Entities of Interest

In this fourth and last category, the objective is to identify entities related to a set of entities of interest (hereafter named seeds), by exploring the paths (e.g., long versus short paths, paths through highly connected nodes versus through very specific nodes, one simple path versus multiple paths connecting cause and effect, ...) that connect them in the network, rather than inferring the underlying paths that transfer the signal from a cause to affected genes like in the previously described approaches (Fig. 19.6). This approach is useful when one is looking for the cause of an observed effect, but they cannot both be mapped on a physical or functional molecular network. A domain where this strategy, for instance, has been proven useful is when dealing with diseases as observed effects.

For most diseases only a limited number of causal genes is currently known [19.91]. Genome-wide association studies, whereby genomic variation are associated to a certain phenotype, typically result in one or more linked chromosomal regions, which in turn can contain several genes. Since the elucidation of disease mechanisms can improve diagnose or medical care, several approaches have been developed to identify novel disease genes.

Motivated by the observation that genes causing a specific or similar disease phenotype tend to lie close to each other in a protein–protein network [19.92], several network-based approaches have been developed. These methods have a common approach in the sense that they try to score candidate disease genes based on the assumption that good candidates reside in the neighborhood of certain a priori determined genes. These a priori determined genes are, for instance, genes known to be involved in the disease [19.93,94,95] or in related phenotypes [19.59,96,97], or differentially expressed genes upon the phenotype [19.98]. These a priori determined disease related genes are called seed genes in what follows.

Several types of measures can be applied to score candidate genes. A first, intuitive way of identifying disease related genes is based on direct neighborhood: candidate genes that are directly connected to one or more seed genes are then predicted to be potentially causative [19.93]. However, it is possible that two disease related genes do not interact with each other directly, but are, for instance, part of the same pathway, and disruption in either one of them leads to the same disease. These cases will be missed by direct neighborhood counting. To account for indirect interactions, one can use the shortest path length between a seed node and a candidate gene as a measure of its relatedness to the seed gene: if a certain candidate node lies at most k edges away from the seed node, it is considered as a disease gene. George et al. [19.94] observed that as the shortest path length between a seed node and a candidate node increases, the sensitivity of identified disease genes improves, but the number of false positives increases exponentially and reduces the specificity.

Protein–protein networks, however, possess the small world property, meaning that the average path length between any two nodes in the network is rather short [19.99]. One consequence of this for methods relying on direct interactions or shortest paths is that it is not very unlikely to observe genes interacting with the disease seed genes but that are unrelated to the disease as such [19.95]. Moreover, methods based on shortest path length ignore the fact that there might be multiple shortest paths or also other paths with longer lengths, which could point to a higher relatedness to the seed gene than when only one path is present between seed and candidate node.

To overcome these limitations, several methods have been developed that consider the topology of the entire network (i.e., global distance measure versus local) to define a distance measure between two nodes. These methods generally propose a strategy based on random walks on graphs (i.e., the interaction network). A random walk of a certain length k on a graph represents a stochastic process starting at a seed node and each subsequently visited node is chosen uniformly at random from the neighbors of each previous node. The steady state probability p x,y,k (G) is then the probability that a random walk of length k, starting at node x would end in node y.

Kohler et al. [19.95] proposed a variant of the random walk, namely the random walk with restart, to identify disease genes in a human PPI network. Here, the walk is allowed to restart in every time step at a known disease gene (i.e., seed gene) with a certain probability. All candidate disease genes are then ranked according to the probability of the random walker reaching the candidate gene from a (set of) seed node(s), which reflects their global relatedness to the known disease seed genes. Applied on 110 disease gene families, together with their known associated genes in a leave-one-out cross-validation setting, they clearly outperformed local measures as direct neighborhood or shortest path, or approaches not based on any network.

Vanunu et al. [19.59] extended this approach by using causal genes both from the disease of interest or similar ones as seed genes in the random walk with restart. This approach can be very useful when no causal genes of the disease are known. Diseases, however, can be very heterogeneous, meaning that they can result in different phenotypes and encompass various subtypes. Exploiting all known disease genes that have been related to a heterogeneous disease might not have a sufficiently high resolution to predict novel genes for a specific subtype of the disease. To overcome this, two other methods proposed to identify gene-phenotype relationships rather than finding the gene-disease relationship directly. This strategy decomposes a disease in phenotypes and tries to identify novel phenotype-related genes by using genes related to one or more phenotypes of interest and related ones as seed genes. Li and Patra [19.96] build upon the random walk with restart strategy, but perform it on a heterogeneous network connecting the gene network (i.e., consisting of PPI) with the phenotype network (i.e., a k nearest neighbor graph presentation of phenotypes similarity) through gene-phenotype relations. Yang et al. [19.97], on the other hand, make use of the associations between protein complexes and the phenotypes of interest to perform a random walk with restart on a protein complex network.

Case Study

The aforementioned methods rank candidate genes based on their connections to known disease genes or to known causal genes of related phenotypes in a protein–protein network. However, these methods are usually ineffective when little is known about the molecular basis of the phenotype (e.g., no confirmed causal genes), or when the observed phenotypes are very specific. To this end, Nitsch et al. [19.98] developed a computational method to prioritize candidate disease genes for which limited or no prior knowledge is available, by using experimental data on differential gene expression between healthy and affected individuals for a phenotype of interest.

Genes for which significant differential expression was measured in an affected tissue compared to wild type are usually considered as promising candidates being involved in the disease. However, not necessarily the expression levels of the disease gene are affected, but rather expression of genes downstream of this causal gene. Therefore, by mapping differential expression levels on a gene network, one expects to observe a disrupted expression module around the disease gene. Other candidate genes that are not causally related to the phenotype should not be part of such a module. For this reason, the relevance of a candidate gene is scored by considering the level of differential expression in its neighborhood in a protein network instead of only taking its own expression level into account, under the assumption that strong candidates tend to be surrounded by differentially expressed neighbors.

In this work, a functional network was created using human protein associations obtained from the STRING database [19.26], since protein interaction networks are still far from complete and according to the authors might give suboptimal results due to many missing components and pathways. For each gene, differential expression values are determined from microarray experiments that measure wild type versus diseased cell lines, for a phenotype of interest. The prioritization can be performed on a list of candidate genes on a chromosomal region of interest (e.g., determined from a linkage study) or genome-wide when no list of candidate genes are available (although this will probably result in many more false positives).

The neighborhood of a candidate gene is then determined by using a graph kernel, namely the Laplacian exponential diffusion kernel. This gives a weight to each gene in the network, which decreases as a function of the distance from the causal gene, taking into account that there might be multiple paths between the causal gene and each gene in the network. It can be seen as a random walk, starting from a node and transitioning to a neighboring node with a certain probability.

Finally, each candidate gene is scored by summing up the levels of differential expression (measures by absolute fold changes) of each gene, weighted by its network distance from the candidate. Higher differential expression of neighboring genes will, therefore, result in higher scores. The significance of a candidate gene is determined by randomly distributing the differential expression data on the network and computing an empirical p-value from the random distribution of scores.

Besides benchmarking, the methodology on several monogenic diseases for which the causal gene is known, the authors also applied their method on the polygenic disorder Stein–Leventhal [19.100] for which currently no disease gene is known. They highly ranked two genes on two different chromosomal regions that were previously assigned a possible role in this disorder, namely fibrilin 3 (FBN3) and follistatin (FST). Another gene, DEAD box4, was found to be the best scoring gene and was suggested as a new candidate gene potentially involved in this disease. Although little is known about the molecular function of DEAD box 4 in mammals, the authors found several indications in literature that indicate a plausible role in the Stein–Leventhal syndrome, for instance because of its association with stem cell recruitment to the ovaries, interaction with the mRNA processing machinery, and impact on apoptosis.

Although the expression levels of genes can be determined by multiple genes together in the case of a polygenic order, making it difficult to determine the true causes of the effects, the approach has nevertheless shown to provide plausible candidates when little knowledge is available for the disease at hand.

Conclusion

Despite the emerging amount of data and the integration strategies presented at the beginning of this chapter, eukaryotic molecular networks are still low in coverage due to their size and complexity. We expect that the increasing number of available data sets will continue to expand current networks and further refine our knowledge on the included interactions, ultimately resulting in a better understanding of the cellʼs behavior. In this chapter we have presented several path finding methodologies to interrogate networks with functional data and showed how they can be used to predict disease genes, unveil hidden signaling paths, ... The use of interaction networks for unveiling mode of actions will only gain in importance with functional omics datasets profiling human diseases growing steadily in number and size.