Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the fundamental problems of biology is to understand complex gene regulatory networks (GRNs), and various mathematical and statistical models have been introduced for inference from GRNs [1]. Based on such networks, over-represented biological processes or pathways of a group of genes are identified by mapping them onto the gene ontology (GO) terms or regulatory structures [2]. These pathway analyses provide the annotations and functional insight of the group of genes which are usually determined by conventional statistical tests such as the t-test. However, these differentially expressed gene (DEG) derived analyses are limited in detecting defective pathways since they only observe the amount of expression of a gene itself rather than considering the flows of expression signals that communicate with neighboring genes.

Here we aim to detect the abnormal pathways of GRNs by modelling them using G-Networks [3] which is a probabilistic model of a system with special agents such as positive and negative customers, signals and triggers. In contrast to normal queuing networks, the negative customers of G-Networks describe the inhibitory effects of GRNs [4, 5]. G-networks have a product form solution which enables us to handle the dynamics of complex GRNs without heavy computation times. The parameters of the modelled GRN are inferred from normal samples with the assumed transition probabilities of gene expression signals. Then the transition probabilities of abnormal conditioned samples are estimated by minimizing the difference between the observed and predicted steady-state probabilities with constraints. Finally, permutation tests are performed to determine the statistical significance of the estimated transition probabilities.

2 G-Networks for Gene Regulatory Networks

Following [4] consider the notion of a “packet” that contains the gene expression signals, and a network node that represents a gene consisting of a queue where its packets are stored and a server where the packets’ fates are determined. Let \(\lambda^{+}_{i}\) and \(\lambda^{-}_{i}\) be the positive and negative packet input rates to the ith node, respectively. \(\mu_{i}\) is the packet firing rate (service rate) of the ith node. Furthermore we define \({\mathbf x}=\{x_1,...,x_n\}\) a non-negative integer n-vector with \({\mathbf x}^{+}_{i}=\{x_1, ..., x_i+1, ..., x_n\},\;{{\mathbf x}}^{-}_{i}=\{x_1, ..., x_i-1, ..., x_n\},\) and \({{\mathbf x}}^{+-}_{ij}=\{x_1, ..., x_i+1, x_j-1, ..., x_n\}.\) Let \(p_{ij}^+\) and \(p_{ij}^-\) be the transition probabilities for packet motion from the ith node to the jth node as a positive and a negative packet, respectively. Note that a negative packet has the effect of disappearing after it destroys one packet of the target node, or it disappears also if it does not find a positive packet to destroy. Lastly, \(d_i\) denotes the probability that a packet leaves the system so that \(\sum_{j=1}^n (p_{ij}^{+} + p_{ij}^{-}) + d_{i}=1.\)

Consider now a random process \({{\mathbf X}}(t)=\{X_1(t),...,X_n(t)\}\) where \(X_i(t)\) is an integer-valued random variable representing the number of packets in the ith node at time \(t \geq 0.\) If Pr(x,t) is the probability that X(t) takes the value x at time t, then the G-network equations are:

$$ \begin{aligned} Pr({\mathbf x},t+\Updelta t) & = \sum^n_{i=1} \bigg[ (\lambda^{+}_{i} \Updelta t + o(\Updelta t)) Pr({\mathbf x}^-_i,t)I({\mathbf x}_i>0) + (\lambda^{-}_{i} \Updelta t + o(\Updelta t)) Pr({\mathbf x}^+_i, t) \\ & \quad+ \sum^n_{j=1}\Big\{(p^+_{ij}\mu_i \Updelta t + o(\Updelta)) Pr({\mathbf x}^{+-}_{ij}, t) I({\mathbf x}_j>0) \\ & \quad+ (p^-_{ij} \mu_i \Updelta t + o(\Updelta))Pr({\mathbf x}^{++}_{ij}, t) + (p^-_{ij} \mu_i \Updelta t + o(\Updelta))Pr({\mathbf x}^{+}_{i}, t) I({\mathbf x}_j=0) \\ & \quad + \sum^n_{l=1} \big( (p_{ijl} \mu_i \Updelta t + o(\Updelta t)) Pr({\mathbf x}^{++-}_{ijl},t) + (p_{jil} \mu_j \Updelta t + o(\Updelta t)) Pr({\mathbf x}^{++-}_{ijl}, t) \big) I({\mathbf x}_l > 0) \Big \} \\ & \quad+ (d_i \mu_i \Updelta t + o(\Updelta t)) Pr({\mathbf x}^+_i, t) + (1- (\lambda^{+}_i + \lambda^{-}_i + \mu_i) \Updelta t + o(\Updelta t)) Pr({\mathbf x},t) \bigg] \\ \end{aligned} $$
(1)

where I(C) is 1 if C is true and 0 otherwise, and \(o(\Updelta t) \rightarrow 0\) as \(\Updelta t \rightarrow 0.\) The complete equilibrium solution of (1) was given in [4]. Let \(q_i\) be the steady-state probability that the ith gene is activated:

$$ q_i= \hbox{min} \left[1, {\frac{\lambda_{i}^+ + \Uplambda_i^+}{\mu_i + \lambda_{i}^- + \Uplambda_i^-}} \right] $$
(2)

with

$$ \Uplambda_i^+ = \sum^n_{j=1} q_j \mu_j p^+_{ji} + \sum^n_{j, l=1, l \neq j} q_j q_l \mu_j p_{jli}\; \hbox{and}\; \Uplambda_i^- = \sum^n_{j=1} q_j \mu_j p^-_{ji} + \sum^n_{j, l=1, l \neq j} q_l \mu_l p_{lij} $$

then the steady-state probability that there are \(x_i\) packets of ith node in each of the n cells is:

$$ \lim_{t \rightarrow \infty} Pr (X_1=x_1,...,X_i=x_i,...,X_n; t) = \Uppi_{i=1}^nq_i^{x_i}( 1- q_i) $$
(3)

3 Abnormal Edge Detection

The packets in the G-network represent latent objects containing the gene expression signal, and we assume that the number of packets is proportional to the mRNA expression levels which are actually observable data. We also assume that the mRNA levels are observations of the steady-state. Therefore the steady-state probability that there is at least one mRNA of ith gene is \(q_i={\frac{a_i}{a_i + 1}}\) from (3) if we denote by \(a_i\) the average mRNA level (average queue length) of ith gene, also given by \(a_i=q_i/(1-q_i).\)

To determine the G-network parameters under normal conditions we use (2) where there are four sets of unknown parameters \(p_{ji} = \{p_{ji}^+, p_{ji}^-, q_{jli}, q_{lij} \},\;\lambda_i^{+}, \;\lambda_i^{-},\) and \(\mu_i.\) We initially set \(p_{ji}=1/(n_j^{out}+1)\) where \(n_j^{out}\) is the out-degree of gene j. We set the packet output rate \(\mu_i\) based on the values of \(\lambda_i^+\) and \(\lambda_i^-\) which are \(\lambda_i^+ = 0.0062\,\text{sec}^{-1}\) and \(\lambda_i^- = 0.002\,\text{sec}^{-1}\) [6], with \(\mu_i=c \cdot n_i^{out}\) where c is a scaling constant. From (2) we have \(q_i=f_i(\lambda^+_i, \lambda^-_i, \mu_i |{\mathbf q}_j, p_{ji})\) where \({\mathbf q}=(q_1,..,q_n).\) Then c can be found by minimizing the following equation given the initial values of \(\lambda^+_i\) and \(\lambda^-_i;\)

$$ \tilde{c} = \hbox{arg}\; \hbox{min}_{c} \sum_i (q_i-f_i( c | {\mathbf q}, p_{ji}, \lambda^+_i, \lambda^-_i))^2 $$
(4)

Once each \(\mu_i\) is determined, we can find the optimal positive input rate \(\lambda^+_i\) which minimizes \((q_i-f_i( \lambda^+_i | {\mathbf q}, p_{ji}, \tilde{\mu}_i, \lambda^-_i))^2\) for each gene with the initial value \(\lambda^-_i\) and a constraint \(0 \le \tilde{\lambda}^+_i \le \mu_i + \lambda^-_i + \Uplambda^-_i-\Uplambda^+_i.\) Then we determine \(\tilde{\lambda}^-_i\) which produces exactly the same values of \(q_i.\)

3.1 Transition Probabilities in Abnormal Conditions

In an abnormal condition, let \(q^{\prime}_i\) be the steady-state probability that ith gene is activated and \(p^{\prime}_{ji}\) be a packet transition probability from the ith gene to jth gene in the same condition. If there are k unknown \(p^{\prime}_{ji}\) for ith gene, then we will denote them by a vector \({{\mathbf p}}_{ki}.\) For the detection of the abnormally behaving pathways, \({{\mathbf p}}_{ki}\) needs to be estimated given the input (\(\tilde{\lambda}_i^+\) and \(\tilde{\lambda}_i^-\)) and output (\(\tilde{\mu}_i\)) rates found in normal conditions. \({{\mathbf p}}_{ki}\) can be determined by minimizing the following sqaured error with two constraints, \(0 \le \tilde{p}^{\prime}_{ji}\) and \(\sum_{i}{\tilde{p}^{\prime}_{ji}} \le 1;\)

$$ \tilde{{\mathbf p}}^{\prime}_{ki} = \arg\min_{{\mathbf p}^{(h)}_{ki}} (q^{\prime}_i-f_i({\mathbf p}^{(h)}_{ki} | {\mathbf q}^{\prime}_j, \tilde{\lambda}^+_i, \tilde{\lambda}^-_i , \tilde{\mu}_i))^2 $$
(5)

where \({\mathbf p}^{(h)}_k\) is the hth hypothesis in the constrained parameter space. Our algorithm searches for the optimal solution iteratively with different initial starting values to reduce the possibility of remaining in a local minimum.

3.2 Permutation Test for the Estimated Transition Probabilities

When the estimated \(\tilde{p}^{\prime}_{ij}\) differs from its initially assumed value \(p_{ij},\) it is necessary to determine if the difference is statistically significant. The null hypothesis of this test will be \(\tilde{p}^{\prime}_{ij} = p_{ij}.\) To proceed with the test the set of samples is shuffled at random and divided into normal and abnormal groups with the same sample size of the original group. Then the proposed method is applied in the same way as the original data. Let M be the number of permutations and \(\tilde{p}^{(m)}_{ij}\) be the estimated transition probability of the mth permutation. Then we can compute the emperical p-value of the \(\tilde{p}^{\prime}_{ij}\) as follows,

$$ p{\text-}\hbox{value\;of}\;\tilde{p}^{\prime}_{ij}=\left\{ \begin{array}{ll} {\frac{1}{M}}\sum_{m=1}^M I(p^{\prime}_{ij} \leq \tilde{p}^{(m)}_{ij}) & \text {if } p^{\prime}_{ij} > p_{ij}\\ {\frac{1}{M}}\sum_{m=1}^M I(p^{\prime}_{ij} > \tilde{p}^{(m)}_{ij}) & \text {if} p^{\prime}_{ij} \leq p_{ij} \end{array} \right. $$

where I(C) is the indicator function. Thus if the p-value is less than \(\alpha_2\) then the null hypothesis is rejected. In our study, \(\alpha_2 = 0.1.\)

4 The p53 Pathway

In order to evaluate our approach using experimental data, we selected the p53 pathway which is a well studied system in human cells whose most important feature is tumor suppression when DNA is damaged. The regulatory structure of the p53 pathway with 30 genes was constructed on the basis of the KEGG database, and we also downloaded two microarray mRNA expression datasets from GEO. The first dataset (GSE12941) consists of 10 non-tumor liver tissue and 10 hepatocellular carcinoma (HCC) samples. The second (GSE6222) is a dataset for the study of liver cancer progression in HCC. In this dataset, we use 2 normal and 10 HCC samples. Before applying the proposed method, the data was normalized and scaled with mean 3 so that the average number of mRNAs of a gene without its interactions in a single cell are assumed to be approximately 3, while the variance was scaled to 1. The gene input and output rates are assumed to be \(0.0062\,{\text sec}^{-1}\) and \(0.002\,{\text sec}^{-1}\) from [6] so that \(0.0062/0.002 \approx 3.\)

Figure 1 shows the average expression levels of genes in each dataset, and the corresponding p-values of t-tests which detect DEGs. The two datasets share nine significant DEGs with 0.05 significance level while GSE12941 has four more DEGs. This similarity can be confirmed by observing their expression patterns in Fig. 1.However interpreting the DEGs even when we know their regulatory structure is yet another challenge. Figure 2 shows the results from our proposed method. Despite the apparent lack of significance of p53 and MDM2 in the t-test, the p53-MDM feedback loop was clearly activated in cancer samples in both datasets. In [7], the p53-MDM2 feedback loop appears to produce oscillatory expression patterns. Thus in terms of the system dynamics, the activation of two pathways between p53 and MDM2 in our result might be more appropriate than the activation of only one pathway from MDM2 to p53. One of the significant pathways in both datasets is TP53-IGFBP3-IGF1 [8]. Also our method properly detects two pathways, ATM-CHEK2-TP53 and ATR-CHEK1-TP53 as expected from [9] in both datasets, which cannot be detected merely by observing the p-values of the DEG test.

Fig. 1
figure 1

Average mRNA expression levels of genes in two datasets, GSE12941 (top) and GSE6222 (middle). The p-values of the t-tests are shown in the bottom panel where the y-axis represents negative natural logarithms of the p-values. Note that multiple testing corrections are not applied in these i-tests

Fig. 2
figure 2

The p53 network with the results of (a) GSE12941 and (b) GSE6222 dataset analysis. The solid line represents significantly activated (black) or inactivated (grey) pathways while the dashed line indicates non-significant pathways. Wider lines represent more significant pathways. The grey nodes are the selected DEGs from the t-test. The radius of a node is larger if its DEG is more significant with a 0.05 significant level. White nodes indicate non-significant genes

5 Discussion

We have proposed a new approach for detecting abnormal pathways in GRNs based on G-network modelling. This method provides an effective way to describe the flows of gene expression signals including negative or inhibitory effects on gene expression. Using some experimental data, we show that one advantage of our approach is that it can detect abnormal information flows in the dynamics of gene pathways. Thanks to existing G-network theory, the model uses a computationally tractable steady-state analysis and therefore does not require a large number of samples from time-dependent data. Moreover the analytic solution provided by G-network theory offers the possibility that our approach may be extended to very large-scale GRN systems.

In order to exploit this analytical tool, our work shows that a successful application of this method requires that the model be started with a reliable prior network structure based on real experimental data, or carefully calibrated GRN information. Though our intial experimental evaluation appears quite positive, further experimental studies will be needed to validate the proposed approach and apply it to attain biological meaningful and clinically useful results.