Keywords

1 Introduction

The reverse engineering of Gene Regulatory Networks (GRN) is an important problem studied in the Systems Biology and Bioinformatics field. The importance of dealing with this problem is that it can help us to understand the regulatory mechanisms of a biological phenomenon under study. For instance, researchers have studied GRN in the context of cell cycle [15], circadian cycle of plants [18], and more important, diseases such as cancer [9, 19]. Given the importance of this problem, several models of GRN along with algorithms for the inference of GRN have been proposed over the years. Generally, these algorithms receive a dataset of gene expression as input and outputs a regulatory network represented by a specific model. The input data are those provided by technologies such as DNA microarrays [22] and, more recently, RNA-Seq [27].

Among the proposed mathematical models of GRN are the Boolean networks [12], Bayesian networks [4], models based on ordinary differential equations [6], to name a few of them. To review a set of models of GRN we suggest the papers [7, 11, 17]. Algorithms for the reverse engineering of GRN are based on mathematical models and each one has its own characteristics besides advantages/advantages. We can not say that there is an algorithm that outperforms the others in every situation. For instance, some algorithms are specific to work with time-series data while some others work with steady-state data. It is worth mentioning that the reverse engineering of GRN is an ill-posed problem, i.e., given a set of gene expression data, there could be several consistent solutions (networks) with these data. This fact makes the problem a very difficult one.

In this work, we modeled the gene expression in a Boolean fashion where a gene can take on two possible values: 0 (not expressed) and 1 (expressed). We also used a probabilistic graphical model known as Dynamic Bayesian Network, which is commonly used to represent complex stochastic processes. It differs from the standard Bayesian Network in the sense that it models the stochastic evolution of a set of variables over time. Both models are defined by a graphical structure and a set of parameters, which together specify a joint distribution over the random variables. Algorithms for learning the parameters and the structure of such networks have been developed [14, 21]. In this paper, to learn the structure of Dynamic Bayesian Networks from time-series gene expression data we search for the network structure that best matches the data using the Bayesian Information Criterion score [23] and the BDe score [8].

Most of the reverse engineering algorithms of gene regulatory networks do not take into account any sources of prior biological knowledge. The approach of using additional biological information in this problem has shown interesting results, as can be seen in [5, 26, 29]. The source of prior knowledge used in this work is a database called STRING [20, 25], which covers more than 2000 organisms in its latest version (10.0). In [13] a similar approach was performed using the Boolean network model and a feature selection algorithm. Here, we use a different model and approach do infer the model, but using the same source of prior biological knowledge. Because we are facing an ill-posed problem, this approach aims to improve the inference of gene interactions, once we take into account some interactions already supported by the literature.

To validate the methodology we used data from the DREAM challenge [2]. We also applied the methodology in a dataset of yeast (Saccharomyces cerevisiae) cell-cycle gene expression [24] and compared the results against the literature through the KEGG (Kyoto Encyclopedia of Genes and Genomes) database [10].

In the next section, we present the methodology, beginning with the introduction of the Dynamic Bayesian Networks, followed by the definitions of the score functions used in the learning process. We also explain how the prior biological knowledge database was used in the process. In Sect. 3 we discuss the results, followed by our conclusions, in Sect. 4.

2 Methodology

In this section we present the mathematical model of GRN used in our methodology along with the methodology to learn the model from the data. We also present the source of prior biological knowledge used and how it is used in the process of reverse engineering a GRN. In the final part of the section we discuss how the methodology is validated.

2.1 Dynamic Bayesian Networks

A Dynamic Bayesian Network (DBN) is a probabilistic graphical model defined by a network structure and a set of parameters to specify the joint probability of random variables. Let \(\mathcal {X} = \{X_1,X_2, \ldots , X_n\}\) be a set of random variables where each variable \(X_i\) can assume values in the finite set \(\mathrm {Val}(X_i)\). In DBN, unlike standard Bayesian networks, the time is taken into account such that \(X_i[t]\) is the value of \(X_i\) at time t. Thus, \(\mathcal {X}[t]\) is the set of all random variables at time t.

To represent the beliefs about the process, a probability distribution over the variables \(\mathcal {X}[0] \cup \mathcal {X}[1] \cup \mathcal {X}[2] \cup \cdots \) is necessary. We assume that the process is Markovian in the sense that \(\mathrm {P}(\mathcal {X}[t+1]\,|\,\mathcal {X}[0],\ldots ,\mathcal {X}[t]) = \mathrm {P}(\mathcal {X}[t+1]\,|\,\mathcal {X}[t])\). The probability \(\mathrm {P}(\mathcal {X}[t+1]\,|\,\mathcal {X}[t])\) is time independent.

A DBN representing the joint distribution over all possible trajectories of a process is composed of: (i) a prior network \(B_0\) specifying a distribution over \(\mathcal {X}[0]\); (ii) a transition network \(B_{\rightarrow }\) over \(\mathcal {X}[0] \cup \mathcal {X}[1]\) that is taken to specify \(\mathrm {P}(\mathcal {X}[t+1]\,|\,\mathcal {X}[t])\) for all t. In Fig. 1(a) and (b) we can see a simple example.

The transition probability of this network is given by

$$\begin{aligned} \mathrm {P}_{B_{\rightarrow }}(\mathcal {X}[1]\,|\,\mathcal {X}[0]) = \prod ^{n}_{i=1} \mathrm {P}_{B_{\rightarrow }}(X_i[1]\,|\,\varvec{\mathrm {Pa}}(X_i[1])), \end{aligned}$$
(1)

where \(\varvec{\mathrm {Pa}}(X_i)\) is the parents of \(X_i\). Thus, the structure of a DBN is defined by a pair \((B_0, B_{\rightarrow })\) corresponding to a network over the variables \(\mathcal {X}[0], \ldots , \mathcal {X}[\infty ]\). In practice, we reason only about the finite interval \(0, \ldots , T\). To this end, we “unroll” the DBN structure into a Bayesian Network over \(\mathcal {X}[0], \ldots , \mathcal {X}[T]\). In slice 0, the parents of \(X_i[0]\) are specified in \(B_0\). In slice \(t+1\), the parents of \(X_i[t+1]\) are nodes in slices t and \(t+1\) corresponding to the parents of \(X_i[1]\) in \(B_{\rightarrow }\). In Fig. 1(c) we can see the result of unrolling the network for 3 time slices. Given a DBN, the joint distribution over \(\mathcal {X}[0], \ldots , \mathcal {X}[T]\) is

$$\begin{aligned} \mathrm {P}(\mathcal {X}[0], \ldots , \mathcal {X}[T]) = \mathrm {P}_{B_0}(\mathcal {X}[0]) \prod _{t=0}^{T-1}\mathrm {P}_{B_{\rightarrow }}(\mathcal {X}[t+1]\,|\,\mathcal {X}[t]). \end{aligned}$$
(2)

2.2 Learning DBN from Complete Data

In this work, the reverse engineering of gene networks is treated as a problem of learning a DBN from complete data. The data in question is a time-series gene expression dataset \(\mathcal {D}\) consisting of K complete observation sequences \(D_1,D_2, \ldots , D_K\). Each sequence \(D_{\ell }\) has length \(N_{\ell }\) and specifies values for the variables \(\mathbf {x}^{\ell }[0], \mathbf {x}^{\ell }[1], \ldots , \mathbf {x}^{\ell }[N_{\ell }]\). Thus, we have K initial instances of initial slices to learn \(B_0\) and \(N = \sum _{\ell } N_{\ell }\) instances of transitions to learn \(B_{\rightarrow }\).

Fig. 1.
figure 1

Figure adapted from [4].

(a) A prior network \(B_0\) and (b) a transition network \(B_{\rightarrow }\) for three variables. In (c) is the corresponding “unrolled” network.

Formally, the reverse engineering problem is stated as follows. Given a dataset \(\mathcal {D}\) of instances of \(\mathcal {X}\), find a DBN \(\mathcal {B}\) with structure \(\mathcal {G}=(B_0, B_{\rightarrow })\) and its parameters \(\mathbf {\Theta }\) that best matches \(\mathcal {D}\). We use a scoring function to measure how a DBN matches the data. The score combines the likelihood of the data according to the network, \(L(\mathcal {B}:\mathcal {D}) = \log \mathrm {P}(\mathcal {D}\,|\,\mathcal {B})\), with some penalty according to the complexity of the network.

BIC Score. We define the Bayesian Information Criterion score and the BDe score according to [3]. First, let us define

$$\begin{aligned} \theta ^{(0)}_{i,j^{'}_i,k^{'}_i} = P(X_i[0] = k^{'}_i\,|\,\varvec{\mathrm {Pa}}(X_i[0])=j^{'}_i) \end{aligned}$$
(3)

and similarly

$$\begin{aligned} \theta ^{\rightarrow }_{i,j_i,k_i} = P(X_i[0] = k_i\,|\,\varvec{\mathrm {Pa}}(X_i[0])=j_i) \end{aligned}$$
(4)

for \(t=1,\ldots ,T\). The notation for the sufficient statistics is given by

$$\begin{aligned} N^{(0)}_{i,j^{'}_i,k^{'}_i} = \sum _{\ell } I(X_i[0] = k^{'}_i, \varvec{\mathrm {Pa}}(X_i[0])=j^{'}_i ; \mathbf {x^{\ell }}) \end{aligned}$$
(5)

and

$$\begin{aligned} N^{\rightarrow }_{i,j_i,k_i} = \sum _{\ell } \sum _t I(X_i[0] = k_i, \varvec{\mathrm {Pa}}(X_i[t])=j_i ; \mathbf {x^{\ell }}) \end{aligned}$$
(6)

where \(I(\cdot ;\mathbf {x^{\ell }})\) is an indicator function which takes on value 1 if the event \(\cdot \) occurs in \(\mathbf {x}^{\ell }\), and 0 otherwise.

Using Eq. 2 and rearranging the terms, the likelihood function is decomposed according to the DBN structure:

$$\begin{aligned} \mathrm {P}(\mathcal {D}\,|\,\mathcal {G},\mathbf {\Theta }_{\mathcal {G}}) = \prod _i \prod _{j^{'}_i} \prod _{k^{'}_i} (\theta ^{(0)}_{i,j^{'}_i,k^{'}_i})^{N^{(0)}_{i,j^{'}_i,k^{'}_i}} \times \prod _i \prod _{j_i} \prod _{k_i} (\theta ^{\rightarrow }_{i,j_i,k_i} )^{N^{\rightarrow }_{i,j_i,k_i}} \end{aligned}$$
(7)

and the log-likelihood is given by

$$\begin{aligned} L(\mathcal {B}:\mathcal {D}) = \sum _i \sum _{j^{'}_i} \sum _{k^{'}_i} N^{(0)}_{i,j^{'}_i,k^{'}_i} \log \theta ^{(0)}_{i,j^{'}_i,k^{'}_i} + \sum _i \sum _{j_i} \sum _{k_i} N^{\rightarrow }_{i,j_i,k_i} \log \theta ^{\rightarrow }_{i,j_i,k_i} . \end{aligned}$$
(8)

This decomposition facilitates the computation of the BIC and BDe scores.

Using the standard maximum likelihood estimate for multinomial distributions, we get the following expression for \(\hat{\mathbf {\Theta }}_{\mathcal {G}}\):

$$\begin{aligned} \hat{\theta }^{(0)}_{i,j^{'}_i,k^{'}_i} = \frac{N^{(0)}_{i,j^{'}_i,k^{'}_i}}{\sum _{k^{'}_i}N^{(0)}_{i,j^{'}_i,k^{'}_i}} , \end{aligned}$$
(9)

and similarly for the transition case. Thus, the BIC score is given by

$$\begin{aligned} \mathrm {BIC}(\mathcal {G}:\mathcal {D}) = \mathrm {BIC}_0 + \mathrm {BIC}_{\rightarrow }, \end{aligned}$$
(10)

where

$$\begin{aligned} \mathrm {BIC}_0 = \sum _i \sum _{j^{'}_i} \sum _{k^{'}_i} N^{(0)}_{i,j^{'}_i,k^{'}_i} \log \hat{\theta }^{(0)}_{i,j^{'}_i,k^{'}_i} - \frac{\log K}{2}\#G_0 \end{aligned}$$
(11)

and

$$\begin{aligned} \mathrm {BIC}_{\rightarrow } = \sum _i \sum _{j_i} \sum _{k_i} N^{\rightarrow }_{i,j_i,k_i} \log \hat{\theta }^{\rightarrow }_{i,j_i,k_i} - \frac{\log N}{2}\#G_{\rightarrow }, \end{aligned}$$
(12)

where \(\#G_0\) and \(\#G_{\rightarrow }\) are the number of parameters in \(B_0\) and \(B_{\rightarrow }\), respectively.

BDe Score. The BDe score is computed by evaluating the following integral:

$$\begin{aligned} \mathrm {P}(\mathcal {D}\,|\,\mathcal {G}) = \int \mathrm {P}(\mathcal {D}\,|\,\mathcal {G},\mathbf {\Theta }_{\mathcal {G}})\mathrm {P}(\mathbf {\Theta }_{\mathcal {G}}\,|\,\mathcal {G})\,d\mathbf {\Theta }_{\mathcal {G}}. \end{aligned}$$
(13)

Using Eq. 7 and assuming Dirichlet priors we can obtain a closed form. A Dirichlet prior for a variable X is specified by a set o hyperparameters \(\{N'_{x}:x \in \mathrm {Val}(X)\}\):

$$\begin{aligned} \mathrm {P}(\mathbf {\Theta }_{X}) = \mathrm {Dirichlet}({N'_{x}:x \in \mathrm {Val}(X)}) \propto \prod _x \theta ^{N'_{x}-1}_x . \end{aligned}$$
(14)

The hyperparameters can be interpreted as pseudo counts. Assuming that for each structure \(\mathcal {G}\) we have hyperparameters \(N'^{(0)}_{i,j^{'}_i,k^{'}_i}\) and \(N'^{\rightarrow }_{i,j_i,k_i}\) we can rewrite \(\mathrm {P}(\mathcal {D}\,|\,\mathcal {G})\) as a product of two terms

$$\begin{aligned} \prod _i \prod _{j^{'}_i} \frac{\varGamma (\sum _{k^{'}_i} N'^{(0)}_{i,j^{'}_i,k^{'}_i})}{\varGamma (\sum _{k^{'}_i} N'^{0}_{i,j^{'}_i,k^{'}_i} + N'^{\rightarrow }_{i,j_i,k_i})} \times \prod _{k^{'}_i} \frac{\varGamma (N'^{(0)}_{i,j^{'}_i,k^{'}_i} + N^{(0)}_{i,j^{'}_i,k^{'}_i})}{\varGamma (N'^{(0)}_{i,j^{'}_i,k^{'}_i})}, \end{aligned}$$
(15)

and similar for the transition case where \(\varGamma (x) = \int ^{\infty }_{0} t^{x-1}e^{-t}dt\) is the Gamma function satisfying \(\varGamma (1) = 1\) and \(\varGamma (x+1)=x\varGamma (x)\). Following [8], we assign the Dirichlet hyperparameters as

$$\begin{aligned} N'^{(0)}_{i,j^{'}_i,k^{'}_i} = \hat{N}^{(0)} \times \mathrm {P}_{B'_0}(X_i[0] = k^{'}_i | \varvec{\mathrm {Pa}}(X_i[0]) = j^{'}_i)\;\;\mathrm {and} \end{aligned}$$
(16)
$$\begin{aligned} N'^{\rightarrow }_{i,j_i,k_i} = \hat{N}^{\rightarrow } \times \mathrm {P}_{B'_{\rightarrow }}(X_i[1] = k_i | \varvec{\mathrm {Pa}}(X_i[1]) = j_i), \end{aligned}$$
(17)

given two equivalent sample sizes \(\hat{N}^{(0)}\) and \(\hat{N}^{\rightarrow }\).

2.3 Prior Biological Knowledge

The source of prior biological knowledge used is the STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins) [25]. This database aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The current version of STRING is v10 and it covers more than 2000 organisms. Besides that, an API interface for the R computing environment is available, allowing us to easily retrieve the information about gene interactions.

The interactions are derived from multiple sources: (i) known experimental interactions are imported from primary databases, (ii) pathway knowledge is parsed from manually curated databases, (iii) automated text-mining is applied to uncover statistical and/or semantic links between proteins, based on Medline abstracts and a large collection of full-text articles, (iv) interactions are predicted de novo by a number of algorithms using genomic information as well as by co-expression analysis and (v) interactions that are observed in one organism are systematically transferred to other organisms, via pre-computed orthology relations [25].

For each pair of genes of a given organism, the STRING database provides a set of scores based on association evidence type: conserved neighborhood, gene fusions, phylogenetic co-occurrence, co-expression, database imports, large-scale experiments and literature co-occurrence. There is also a combined score expressing confidence when an association is supported by several types of evidence. For details see reference [20].

Let \(s(X_i,X_j)\) be the combined score for a pair of genes obtained from STRING. First, we normalize the scores so that \(0 \le s(X_i,X_j) \le 1\). Given that, the biological score is defined as the mean

$$\begin{aligned} \beta (X_i, \varvec{\mathrm {Pa}}(X_i)) = \frac{1}{|\varvec{\mathrm {Pa}}(X_i)|} \sum _{X_j \in \varvec{\mathrm {Pa}}(X_i)} s(X_i,X_j) . \end{aligned}$$
(18)

The decomposition in Eq. 8 facilitates the computation of the BIC and BDe scores. Notice that the likelihood is expressed as a sum of terms, where each term depends only on the conditional probability of a variable \(X_i\) given its parents \(\varvec{\mathrm {Pa}}(X_i)\). Thus, when computing the BIC or BDe score, we add the term \(\omega \cdot \beta (X_i, \varvec{\mathrm {Pa}}(X_i))\), where \(\omega \in \mathbb {R}\) is the weight given to the prior biological knowledge. Obviously, \(\omega =0\) means that we are not taking into account the biological knowledge.

2.4 Validation

To validate and analyze the methodology, assume that \(\mathcal {G}^{*}\) is the gold standard network and \(\mathcal {G}\) is the output network. Then, it is possible to fill the entries of a confusion matrix [28] according to Table 1. According to this table, if an edge/connection is present in the gold standard and it is correctly inferred, then it counts as a true positive (TP).

Table 1. Confusion matrix. TP = true positive; FP = false positive; FN = false negative; TN = true negative.

To quantify the quality of the output network, we used a similarity [1] measure defined as

$$\begin{aligned} \mathrm {Similarity}(\mathcal {G}^{*}, \mathcal {G}) = \sqrt{\mathrm {PPV} \times \mathrm {Specificity}}, \end{aligned}$$
(19)

where

$$\begin{aligned} \mathrm {PPV} = \frac{\mathrm {TP}}{\mathrm {TP} + \mathrm {FP}} \text{ and } \mathrm {Specificity} = \frac{\mathrm {TN}}{\mathrm {TN} + \mathrm {FP}}. \end{aligned}$$
(20)

2.5 Implementation

The methodology was implemented by extending an open source package named Pgmpy (http://pgmpy.org), which is a Python library for working with probabilistic graphical models. This is an ongoing project and for DBN only the parameters inference was implemented. We extended the library by implementing the learning of DBN structure given a gene expression dataset and a source of prior biological knowledge provided by the STRING database.

3 Results and Discussion

In order to apply the proposed methodology, we performed experiments using data from the DREAM4 Challenge (Dialogue for Reverse Engineering Assessments and Methods) [2] and a yeast (Saccharomyces cerevisiae) cell cycle study [24]. All datasets used were discretized using an algorithm called Bikmeans [16].

3.1 DREAM4 Data Results

The DREAM4 in silico challenge data consists of a time series generated from networks of 10 and 100 genes. There are 5 networks of size 10 and for each of them there are 5 datasets of 21 time points. For the network of size 100 the number of networks is also 5, but there are 10 datasets of 21 time points each. This experiment validates the methodology without the use of prior biological knowledge, once it consists of synthetic data.

In Fig. 2 we compare the standard Bayesian Network (BN), the Bayesian Network considering the time factor (BN with time), and Dynamic Bayesian Network (DBN) by using the BIC score. The BN with time simply means that, when estimating the conditional probabilities for a gene \(X_i[t]\), the values of its parents in the samples are taken at time \(t+1\). Networks from 1 to 5 are composed by 10 genes and from 6 to 10 by 100 genes. We can observe that DBN performs better in most cases.

Fig. 2.
figure 2

Comparison between the standard Bayesian Network and the Dynamic Bayesian Network, using the BIC score.

In Fig. 3 we performed the same comparison, but using the BDe score. In some cases DBN presented better results, but in others the BN with time was better.

Fig. 3.
figure 3

Comparison between the standard Bayesian Network and the Dynamic Bayesian Network, using the BDe score.

We also compared the true positive rate between BIC and BDe scores. In Fig. 4 we can observe that the BDe score performs slightly better in most experiments. From 1 to 10 we used BN with time, where from 1 to 5 are the networks of size 10 genes and, from 6 to 10 are networks of size 100. From 11 to 20 we used DBN, where from 11 to 15 are networks of size 10 and, from 16 to 20 of size 100.

Fig. 4.
figure 4

Comparing the true positive rates obtained when using BIC and BDe score for the DREAM4 data.

3.2 Yeast Cell Cycle Data Results

We tested the methodology using a dataset of yeast cell cycle gene expression provided by [24]. This dataset was obtained from six experiments, named cln3 (2 time points), clb2 (2), alpha (18), cdc15 (24), cdc28 (17) and elutriation (14). In this experiment, we selected 11 genes considered to be key regulators of the cell cycle process according to [15]. The gold standard network was obtained from the KEGG database.

In Fig. 5 we show the similarities values when using the BIC score and the prior biological knowledge for \(\omega =1,2,4\). The completeData bars was generated using all six experiments together. Because the experiments cln3 and clb2 are composed by only 2 samples each we did not use them individually. We can observe that the biological knowledge improves the quality of the output network specially in the cdc28 and alpha experiments.

Fig. 5.
figure 5

Similarity of the yeast cell cycle network using BIC score and considering biological knowledge with \(\omega = 1,2,4\).

In Fig. 6 the show the results when using the same data, but with the BDe score. We can observe that the similarity values are slightly worst when compared to the BIC score experiment. It is also possible to notice a soft improvement of the quality when considering the biological knowledge.

Fig. 6.
figure 6

Similarity of the yeast cell cycle network using BDe score and considering biological knowledge with \(\omega = 1,2,4\).

Fig. 7.
figure 7

Comparing the true positive rates obtained when using BIC and BDe score for the yeast cell cycle data.

Comparing the true positive rates between BIC and BDe scores shows that BDe performs better in most of our experiments, as we can observe in Fig. 7. In experiments 1 to 4 we used the completeData with biological knowledge plus \(\omega =0,1,2,4\), respectively. In experiments 5 to 8 the alpha dataset was used, following the same pattern of \(\omega \). Experiments 9 to 12 correspond to the cdc15 dataset, 13 to 16 correspond to the cdc28 dataset, and from 17 to 20 to the elutriation dataset.

4 Conclusion

In this work, we proposed a methodology for the reverse engineering of gene regulatory networks from time series gene expression using dynamic bayesian networks as well. We also proposed the use of a source of prior biological knowledge, known as STRING, to improve the quality of the inferred networks. The methodology was tested on two datasets: artificial dataset provided by the DREAM4 in silico challenge and a biological dataset from a yeast cell cycle study.

To learn the structure of the DBN we search for the network that best matches the data, based on a score function. We used two types of functions, BIC and BDe scores, and compared them. The results show that BDe score performs better in most of the experiments.

We also compared the standard Bayesian Network model against the Dynamic Bayesian Network, showing that the DBN provides better results than its standard version. One important point to highlight is that our methodology considers a source of prior biological knowledge, when available, unlike most of the reverse engineering algorithms. We showed through our experiments that this approach can improve the quality of the inferred networks. As a future work, we can explore how to use various sources of biological knowledge instead of only one.