Abstract
Cells perceive and respond to their microenvironment as a part of their functioning via networks of processes resulting from molecular interactions. The complexity of such networks has been the subject of studies that address their various aspects. Some of these include static methods that focus on graph representations and their consequent properties, while others take a dynamical systems approach based on simulations. Here, we address the problem of identifying dominant pathways in biological networks that are represented as activation and repression edges. For this purpose, we propose a hybrid method that combines static graph properties with a dynamic quantification of information flow that results from stochastic simulations. We first illustrate our method on a simple example, and then apply it to the Escherichia coli transcription network consisting of 4639 regulatory edges.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The complex biological processes dedicated to sustaining life are commonly represented by various kinds of networks for formal analysis. These representations cover a broad spectrum from detailed rule-based models and chemical reaction networks to other stoichiometric representations as well as graph models with varying details. These different methods have their individual strengths in delivering new insights on a rich landscape of biological queries. However, more than often, the availability of empirical data, or the lack thereof, poses a bottleneck in the formal setting within the context of specific studies.
Despite the inherent complexity, experimental findings indicate that certain aspects and patterns are common in many biological networks. Some of these features resemble those in the networks that are observed outside the biochemical realm such as communication networks. By relying on these similarities, here we work with the consideration of networks of biochemical entities as processes that relay the incoming stimulus to response components. We aim at benefitting from this notion especially when larger biological processes are considered and kinetic data is scarce or difficult to apply. In particular, we address the problem of identifying dominant pathways: within an organism, biological processes work together to produce an overall global flux distribution by making the flow of information accessible over different pathways [10, 12, 16, 20]. Despite the presence of many pathways, some of the available pathways are more dominant in terms of their information flow capacity.
In the light of the observations above, our method is based on the following assumptions on biological systems. Firstly, information propagates by a series of coupled biochemical reactions, whereby cellular signals are relayed predominantly through functional modules of highly connected nodes, see, e.g., [13]. In particular, in the extreme case of scale-free networks, whose degree distribution follows a power law [1], only a tiny fragment of hub nodes process a large fragment of the information. Secondly, local signal transduction tends to be stochastic, hence information propagation by individual components is subject to noise. And finally, in the presence of multiple pathways for the signal, information flow has a predilection for the pathway of least biochemical resistance.
Our analysis of biological systems combines methods from static graph theoretical considerations in the literature with those for dynamical systems based on simulations. We work with biological systems that can be represented as directed graphs with two kinds of edges, namely activation and repression. Many biological systems can be represented in this form as well as gene transcription networks that easily fit into this category. We map each edge of the network to a reaction of a chemical reaction network (CRN). The idea here is that each activation edge consumes the instance of the incoming signal at its source node to propagate the information flow by producing an instance of the target node. The signal can then be passed on to the next reaction. Each repression edge, on the other hand, consumes the instance of the incoming signal together with an instance of its target node if it is available. This way, it inhibits the further propagation of the signal from its target node.
In accordance with the considerations above, we resort to the idea that cellular signals are transmitted dominantly through pathways of highly connected components. We implement our method by computing the reaction rates as proxies of connectivity of the source and target nodes of the corresponding edge in the network. We instantiate the rates by borrowing three topological measures from the literature that are used to study the static properties of graphs. These are the topological overlap measure (TOM) [18], the Randić index (RI) [17], and the combined linkage index (CLI) [15]. For each network, we produce three different CRNs using these measures to instantiate the reaction rates, together with a control CRN that assigns 1.0 to all the rates.
The dynamic component of our method is realised by running stochastic simulations on the CRNs by using the Gillespie algorithm that implements mass action kinetics [5]. In previous work, we have developed a conservative extension of this algorithm that traces species fluxes during simulation [8, 9]. We compare the results with TOM, RI, CLI models and the control model by quantifying the distance between their results. We quantify the flow of information in different models in terms of their simulation fluxes on the mean of repeated simulations. We then apply exhaustive breadth-first search on the resulting flux graphs to enumerate the pathways and rank them.
In the following, we illustrate our method on a simple network. We then apply it to the E. coli transcription network [4] with 4639 regulatory edges.Footnote 1
2 Information Flow in Biological Networks
2.1 Network Implementation
We work with networks consisting of directed graphs with two kinds of edges. Formally, an information flow network \(\mathcal {G} = (\mathcal {V}, \mathcal {A}, \mathcal {I})\) is given with
-
the set \(\mathcal {V}\) of vertices representing the biochemical molecules or events;
-
the set \(\mathcal {A}\) of directed activation edges where the source activates the target;
-
and the set \(\mathcal {I}\) of directed inhibition edges where the source inhibits the target.
Example 1
The network depicted in Fig. 1 provides a description of a fragment of the dopamine signalling network.
The degree of a node x, denoted with \(\textit{deg}(x)\), is the number of edges incident to the node, with loops counted twice. For two nodes, we define \(\textit{edge}(x,y)\) as the number of edges, be it activation or inhibition, from x to y. We define \(\textit{int}(x,y)\) as the number of nodes that are connected with a single edge to both x and y.
Example 2
In the network depicted in Fig. 1, deg(GBetaGamma) = 6 and deg(PQCaCh) = 2. We have that \(\textit{edge}(\textsf {GBetaGama},\textsf {PQCaCh}) = 1\). Because they do not have any common neighbours, \(\textit{int}(\textsf {GBetaGamma},\textsf {PQCaCh}) = 0\)
At the first step, our algorithm for computing the information flow maps the network to a chemical reaction network (CRN), whereby activation and inhibition edges are given with two different kinds of reactions.
The activation edges of the form \((\textsf {x}, \textsf {y})\) are mapped to reactions
which model the information flow from \(\textsf {x}\) to \(\textsf {y}\), and r is the rate of the reaction. By relying on the notion that cellular signals are transmitted dominantly through pathways of highly connected components, we compute the reaction rates as proxies of connectivity of the source and target nodes of the corresponding edge in the network. For this, we employ three different measures from the literature.
-
1.
The topological overlap measure (TOM) [18], which was originally introduced to study the relationship between the network structure and the functional organisation of cellular metabolisms. We obtain the TOM rate value r as:
$$ \begin{array}{l} r = \frac{{\textit{int(x,y)}} + 1}{{\textit{min(deg(x),} \,\,\textit{deg(y))}}} \end{array} $$ -
2.
The Randić index (RI) [17] has been related to physical and chemical properties of organic molecules. We apply it to a single edge as follows:
$$ \begin{array}{l} r = \frac{1}{\sqrt{{\textit{deg(x)}} . {\textit{deg(y)}}}} \end{array} $$ -
3.
The combined linkage index (CLI) [15] extends RI with the aim of emphasising the strongest links of each node.
The inhibitory edges of the form \((\textsf {x}, \textsf {y})\) are mapped to reactions of the form
Such a reaction models the consumption of the information at \(\textsf {x}\) to inhibit the further downstream flow from \(\textsf {y}\) to other components of the network. The reaction rate is given by \(r' = r . p\), whereby r is defined as above and the constant p is the inhibition constant, which factors for these second order reactions that can have a much higher propensity in comparison to first-order activation reactions. In our analysis, we first use a default value of 1.0 for p, and also evaluate the effect of smaller values.
Example 3
By applying the definitions of TOM, RI and CLI to the network in Fig. 1, we obtain the CRN with the reactions listed in Fig. 2 and together with the rate values listed in Table 1.
2.2 Network Simulation
CRNs can be simulated stochastically by using Gillespie’s direct method, which is also known as the stochastic simulation algorithm (SSA) [5]. Various extensions of SSA in the literature address a variety of concerns such as increasing efficiency of simulations, simulation of rare events or others, e.g., [3, 6, 11]. In previous work [8, 9], we have presented a method that extends SSA for stochastic flux analysis of CRNs. The method, called fSSA, is a conservative extension of SSA that monitors the distribution of the network resources during simulation with respect to the causal interdependence of the reaction instances. This consideration originates from non-interleaving models of concurrent computations used in computer science [7, 14]. In such a setting, the dependencies are observed in a manner that takes into account the propensity of each reaction in terms of the resources available to that reaction. As a result of this, simulations resulting from our algorithm provide a quantitative view of the flow of information in the network besides the usual time series information. The flux graphs, that are output by the algorithm, reflect what fragment of system resources flow through which pathways of the network. This kind of information becomes particularly significant when a system resource is produced or consumed by multiple components. In this regard, flux graphs display which components produce and consume such resources. For example, in the network above, GBetaGamma production and consumption can follow many different pathways in the network.
We use the fSSA algorithm to run simulations on the CRNs. During these simulations, flux graphs can be obtained for the whole simulation interval as well as for arbitrary time intervals. In contrast to similar considerations with ordinary differential equations, these time intervals can be transient intervals, whereby the system has not yet reached its steady state levels, given by the ordinary differential equation simulations. Because the flow of resources can take different pathways at different intervals of the simulation, such a capability is essential for analysing the system behaviour at different stages.
For the example network in Fig. 1, we have obtained the flux graph depicted in Fig. 2. The measures described above, that is, TOM, RI and CLI, result in different reaction rates, listed in Table 1, thus they result in different values for the fluxes. However they all result in the same topology depicted in Figs. 2 and 3. Due to stochasticity, each simulation with the same CRN produces slightly different fluxes. For a systematic comparison that takes into account these variations as well as the effect of the different measures, we have first set a control network, where all the reaction rates are set to 1.0. With the inclusion of this network, we have obtained four different networks; three given by TOM, RI, and CLI, and a control network. For each one of these four networks, we have run 10 simulations. For each flux edge in a network, we computed the mean of 10 simulations, and then normalised these mean fluxes according to the maximum flux of each network.
Our simulations resulted in the normalised flux values listed in Table 6, where we have considered a spectrum of inhibition constants. Figure 2 provides a summary of the data in Table 6 with respect to the effect of varying inhibition constant from \(p=10^{-4}\) to \(p=1.0\). We observe that the inhibition constant p does not have a significant impact in general. More interestingly, the variations in p affect the versions of CRN that are instantiated with different measures similarly. Most of the fluxes are affected to an extent of 0.04% of maximum flux, and the greatest effect is on the fluxes that feed reaction 15 or compete with these fluxes, which however do not exceed 16% even with \(p=10^{-4}\), and these greater effects are pronounced at the lower end of the spectrum. Figure 3 displays the fluxes with \(p=10^{-2}\) for TOM, RI and CLI measures.
To compare the impact of the different measures on the simulations and the resulting fluxes, we have computed the distance between the results with different networks. We define this as the sum of squared distances between normalised fluxes. That is, given that \(\mathcal {F}_1\) and \(\mathcal {F}_2\) are flux graphs as in Fig. 3, for each flux edge from a reaction x to reaction y with \(w_1\) in \(\mathcal {F}_1\) and \(w_2\) in \(\mathcal {F}_2\), we compute the sum of the values \((w_1 -w_2)^2\). If a flux edge does not exist, its weight is 0.
For this network, we observe in Table 2 that the inhibition constant does not play a significant role in distinguishing the effect of different measures, which confirms our observations above. We observe that RI measure provides the greatest distinction from the control network and CLI provides the smallest distinction. The much larger distance between CLI and RI confirms this observation. Moreover, RI and TOM appear similar. Based on these observations, Table 3 enumerates the flux pathways by ranking them according to their mean fluxes, where RI measure is used as reference. As indicated by the observations in Table 2, all measures agree on the first two rankings and RI and TOM have the same rankings, which are different from those with the control network.
3 A Case Study: Escherichia coli Transcription Network
We have applied our method to the Escherichia coli transcription network version 10.5 reported in the RegulonDB [4] with the date 13 September 2018, which is depicted in Fig. 4. In this network, the distribution of the nodes with respect to their frequency in regulations follows a power-law, whereby 1610 of the 1886 proteins participate in not more than 5 regulations. As listed in Table 7, CRP has the highest frequency as it participates in 585 regulations, followed by FNR with 322, IHF with 259 and H-NS with 195 regulations.
We applied the four measures given by TOM, RI and CLI as well as the control model with all the rates set to 1.0, and considered the inhibition constants \(p = 10^{-2}\) and \(p = 1.0\), and this way obtained 8 different versions of the model. Each of these networks consist of 4639 reactions with 1886 species. We focused our investigation on the pathways initiated by CRP as this transcription factor has the highest frequency among all the 63. We performed 10 simulations with an initial value of 10000 CRP molecules for all of the 8 cases. Each of these simulations resulted in 1000 to 1500 flux edges. For each case, we took the mean of each flux edge given by the 10 simulations. We then normalised each one of the 8 flux graphs with the maximum flux in that graph.
For a comparison, we first computed the squared distance between the 8 flux graphs. To emphasise the effect of different measures, we have taken the sum of the fluxes for each species in flux graphs and computed the squared distance on these sums. The differences between TOM, RI, CLI and 1.0 model for each of the \(p = 10^{-2}\) and \(p = 1.0\) values are listed in Table 4. The differences between \(p = 10^{-2}\) and \(p = 1.0\) for each of TOM, RI and CLI and 1.0 are listed in Table 5. We observe in Tables 4 and 5 that the inhibition constant does not play a significant role in distinguishing the effect of different measures as before with the exception of 1.0 network. This observation confirms that the rates provided by the measures plays a more significant role in determining the fluxes in comparison to the inhibition constant. However, in the control model, the inhibition constant plays a greater role in determining the system behaviour.
Table 4 indicates that CLI measure provides the greatest distinction from the control model and TOM provides the smallest distinction. The large distance between TOM and CLI and the one between RI and CLI as well as the much smaller distance between RI and TOM confirm this observation.
The different measures resulted in different numbers of pathways. With \(p= 0.01\), CLI has generated 1388 pathways, whereas RI has generated 998, TOM has generated 1340 and the control network has generated 1535. With \(p= 1.0\), CLI has generated 1285 pathways, whereas RI has generated 1210 pathways, TOM has generated 1251, and the control network has generated 1535 pathways. The resulting list of pathways for all the 8 cases can be downloaded together with all the scripts that are used to apply the methods above.Footnote 2
4 Discussion
We have proposed a method for enumerating dominant pathways in biological networks that can be represented as directed graphs consisting of activation and repression edges. Our analysis combines methods from static graph theoretical considerations in the literature with those for dynamical systems, based on simulations. Our method emphasises the inherent stochasticity in biological processes as well as the notion that cellular signals are relayed predominantly through highly connected nodes and pathways of least biochemical resistance.
The stochastic simulations in our examples result in individual simulation trajectories that expose the noise in the system. The notion of stochastic flux, delivered by these simulations, provides a direct quantification of information flow, for any time interval, including the transient states. However, averaging over many simulations as in the examples above dampens the stochastic noise. If a deterministic notion of information flow can be characterised, linear noise approximation simulations [2] or deterministic ODE simulations can be considered for the analysis of the systems where the stochastic noise is less of a concern.
As evidenced by our case study on E. coli network, the measures, TOM, RI, and CLI have a significant effect on determining the dominant pathways. In this regard, a more extensive evaluation of these measures as well as others in the literature is a topic of further investigation. Moreover, the ranking of the pathways is subject to parameters such as pathway length and flux strengths at various segments, which can change the ranking. An evaluation of these parameters in the context of biological evidence for the E. coli network and in applications to other large networks are topics of future work.
Notes
- 1.
All the data and scripts are available for download at: ozan-k.com/pathways.zip.
- 2.
References
Albert, R.: Scale-free networks in cell biology. J. Cell Sci. 118, 4947–4957 (2005)
Cardelli, L., Kwiatkowska, M., Laurenti, L.: Stochastic analysis of chemical reaction networks using linear noise approximation. Biosystems 149, 26–33 (2016)
Erhard, F., Friedel, C.C., Zimmer, R.: FERN: a Java framework for stochastic simulation and evaluation of reaction networks. BMC Bioinform. 9, 356 (2008)
Gama-Castro, S., et al.: RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 44(D1), D133–D143 (2016)
Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977)
Gillespie, D.T.: Approximate accelerated stochastic simulation of chemically reacting systems. J. Chem. Phys. 115(4), 1716 (2001)
Kahramanoğulları, O.: On linear logic planning and concurrency. Inf. Comput. 207, 1229–1258 (2009)
Kahramanoğulları, O.: Quantifying information flow in chemical reaction networks. In: Figueiredo, D., Martín-Vide, C., Pratas, D., Vega-Rodríguez, M.A. (eds.) AlCoB 2017. LNCS, vol. 10252, pp. 155–166. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58163-7_11
Kahramanoğulları, O., Lynch, J.: Stochastic flux analysis of chemical reaction networks. BMC Syst. Biol. 7, 133 (2013)
Khatri, P., Sirota, M., Butte, A.J.: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8(2), e1002375 (2012)
Kuwahara, H., Mura, I.: An efficient and exact stochastic simulation method to analyze rare events in biochemical systems. J. Chem. Phys. 129(16), 10B619 (2008)
Ma, S., Jiang, T., Jiang, R.: Differential regulation enrichment analysis via the integration of transcriptional regulatory network and gene expression data. Bioinformatics 31(4), 563–571 (2015)
Ma’ayan, A., et al.: Formation of regulatory patterns during signal propagation in a Mammalian cellular network. Science 309(5737), 1078–83 (2005)
Nielsen, M., Plotkin, G., Winskel, G.: Event structures and domains, part 1. Theor. Comput. Sci. 5(3), 223–256 (1981)
Persson, O.: Identifying research themes with weighted direct citation links. J. Informetr. 4(3), 415–422 (2010)
Planes, F.J., Beasley, J.E.: A critical examination of stoichiometric and path-finding approaches to metabolic pathways. Brief. Bioinform. 9(5), 422–436 (2008)
Randić, M.: Characterization of molecular branching. J. Am. Chem. Soc. 97(23), 6609–6615 (1975)
Ravasz, E., et al.: Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002)
Shannon, P., et al.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13(11), 2498–504 (2003)
Zubarev, R.A., et al.: Identification of dominant signaling pathways from proteomics expression data. J. Proteomics 71(1), 89–96 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Kahramanoğulları, O. (2019). Enumerating Dominant Pathways in Biological Networks by Information Flow Analysis. In: Holmes, I., Martín-Vide, C., Vega-Rodríguez, M. (eds) Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science(), vol 11488. Springer, Cham. https://doi.org/10.1007/978-3-030-18174-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-18174-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18173-4
Online ISBN: 978-3-030-18174-1
eBook Packages: Computer ScienceComputer Science (R0)