Keywords

1 Introduction

Rapid progress in the development of next-generation sequencing technologies for genomics has provided valuable insights into complex biological systems [12]. Modelling single-cell or gene networks is becoming increasingly important. The question of modelling complex molecular regulatory networks is an important one for bioinformatics. The goal of systems biology is to intervene on the state of the cell, using the dynamics of the underlying regulatory network. A model that could accurately represent such dynamics could be used for analysis, including control [14, 19, 26, 27, 36], steady-state distribution [8, 18, 24, 31], observability [28, 37, 38]. Such analyses aid the development of genetic therapies [11].

Boolean Networks (BNs) were introduced for this purpose by Kauffman [15]. In brief, a BN comprises a set of Boolean variables, each variable representing the on/off state of a gene, while interactions between genes are expressed by Boolean functions. It was found that even randomly generated BNs exhibit behaviour reminiscent of gene regulatory networks, with naturally arising attractor states which represent cell types or the phenotype [6, 35]. This explains the popularity of BNs for modelling gene interactions [2, 10].

However, with few exceptions, gene expression data suggests a number of possible successor states to any given state in a BN, thereby refuting the determinism inherent in BNs. Thus, a probabilistic BN (PBN) was introduced by Shmulevich et al. [30] in which the definition of a BN was adapted such that for each gene, at each time point, a Boolean function (and predictor gene set) is chosen with some conditional probability [29].

Inferring the PBN representation of a gene regulatory network (GRN) is quite involved. First, the directed graph expressing interactions between genes needs to be constructed; then, the Boolean functions need to be determined; followed by determining the probabilities of selecting a Boolean function as well as the number of candidate functions on each gene. Existing work (cf Sect. 2) tends to focus on inference from time-series gene expression data as the temporal aspect reveals the transition structure of the corresponding PBN. However, as already pointed out in [4], there are concerns over the number of (typically expensive to obtain) observations needed in such gene microarray data. Approaches based on ODEs (e.g., [21]) require lots of observations to tune the large number of parameters of the model, while in practice only a handful are available. More such observations are available when the underlying gene network is at a steady state [31], e.g, see gene expression profiles of melanoma by Bittner et al. [5].

In this paper, we propose a systematic method for inferring PBNs directly from real gene expression data measurements, collected using microarray technology, when the system is at a steady-state. The steady-state (long-run) behaviour of a PBN is of interest to system biology as it allows to determine the long-term influence of a gene on another gene or determine the long-term joint probabilistic behaviour of a few selected genes [31].

The key contribution of our paper is a reproducible pipeline for going from gene (steady-state) data samples to the PBN representation of the long-run behaviour of the underlying genetic network. We use a predictor gene set rather than temporal data to infer the "transition structure". Unlike other proposals, our method does not require the construction of the probability transition matrix, whose size grows exponentially on the number of nodes, and hence becomes computationally intractable for larger networks [1].

The remainder of the paper is structured as follows. Section 2 outlines related work. Preliminary background knowledge is presented in Sect. 3. The main algorithm for our inference method is in Sect. 4. PBNs are produced in Sect. 6 using the process described in Sect. 5. Concluding remarks are in Sect. 7.

2 Related Work

There have been various methods for PBN inference, focusing on causality, using different types of gene data [13]. Previous work on PBN inference from time series gene data includes [32], SCODE [21] with ODEs, and most recently the Stochastic Conjunctive Normal Form (SCNF) -based method by Apostolopoulou et al. [3] which can address larger networks.

Previous work on inference from steady-state data samples is relatively limited and goes back to Shmulevich et al. [31]. A tool for computing the steady-state distribution (ssd) probabilities has been proposed in [23]. Melkman et al. [22] infer threshold PBNs, a particular version of PBNs where every input threshold function of a node must have the same number of parameters and also satisfy certain stringent conditions. Kobayashi et al. [18] construct PBNs from BNs by casting inference as an integer linear programming problem and construct a PBN that fits the given steady-state distribution.

Kim et al. [17] use steady-state gene data samples from the study on metastatic melanoma by Bittner et al. [5] (we use the same data here). They choose the genes for their PBN using a combination of Coefficient of Determination (COD) analysis and biological background knowledge (we do not assume any prior knowledge). For the functions, they ternarise their data, and construct Lookup Tables in place of the functions for each gene. They also analyse the PBNs produced by analysing the steady-state distribution (ssd) of the resulting network.

Shmulevich et al. [30], who introduced PBNs, describe a method for determining functions for nodes in a PBN. This requires finding sets of input genes which have high COD with the target gene, and using the predictive model used for the calculation of the COD as the function for the particular set of input genes. The probability for choosing the particular input gene set is proportional to the COD of the input gene set.

Discretisation of gene data is an important factor for inference. Chen et al. [7] describe a method for quantising gene data using the expressions of housekeeping genes within the dataset. Housekeeping genes are genes which keep a constant expression, as they perform important functions within the cell. Since they have a constant expression, they can be used to estimate the probability distribution function (PDF) of the gene expressions within a microarray. The constructed PDF can be used for using a hypothesis test to determine whether or not a gene is over- or under-expressed. However, this method hinges on knowledge of which of the genes are housekeeping genes and this typically is not readily available.

As discussed in the introductory section, we focus on constructing PBNs from real, microarray gene data samples, collected while the system is in a steady-state, instead of simulated, time-series data or starting from BNs. We present a reproducible method to perform such a task.

3 Preliminaries

3.1 Boolean Networks

A BN [15] is a directed graph, \(G = \{V,E\}\), comprised of vertices V and edges E. The vertices \(v \in V\) represent the Boolean variables, which in this case represent genes in a gene regulatory network. The directed edges \(\{v_i, v_j\} = e_{i,j} \in E\) represent that one variable, \(v_i\), influences another, \(v_j\). Each vertex is associated with a Boolean function \(f_i\) given by \(f_i: \{0,1\}^{n_{in}} \mapsto \{0,1\}\). The input for \(f_i\) is a Boolean vector of length \(n_{in}\), which represents the states of all of the input vertices, and the output is a single Boolean value, which is then used as the next state of the variable \(v_i\). For a vertex i, the input vertices are the vertices from which all incoming edges originate, given by \(\{v_j | \exists \{v_j, v_i\}\} = e_{j,i} \in E\).

3.2 Probabilistic Boolean Networks

Probabilistic Boolean networks are an extension of Boolean networks. They are directed graphs G, as in Boolean networks, except each function \(f_i\) for each node i in the case of Boolean networks is replaced by a set of Boolean functions \(F_i = \{f_i^1, f_i^2, \dots , f_i^{l_i}\}\), and probabilities \(c_i = \{c_i^1, c_i^2, \dots , c_i^{l_i}\}\). Hence, the logical function \(f_i\) has \(l_i\) possibilities, each with a corresponding conditional probability of being selected at every time step.

More formally, during run time, a function \(f_i^j\) for the node \(v_i\) is chosen with probability \(c_i^j\), \(j \in [1,l_i]\). PBNs are an extension to BNs in the sense that if each node within a PBN has a single function, it becomes identical to the BN.

3.3 State Transition Graphs

For each PBN there exists a state transition graph (STG). An STG is a directed graph \(G = \{V,E\}\), where the vertices \(v_i \in V\) represent the possible states of the PBN, and the edges \(\{v_i, v_j\} = e_{i,j} \in E\) represent the possibility of a transition from state \(v_i\) to \(v_j\). Since the probability of getting to another state \(v_j\) only depends on the current state \(v_i\), we can say that the STG is a Markov chain.

By saying that the PBN has a steady state distribution (ssd), we mean that the STG of the PBN has a steady state distribution. For an STG to have an SSD, it needs to be ergodic - that is, every state can be reached from every other state. To guarantee that the STG is ergodic, random perturbations with low probability are introduced to the PBN.

3.4 Microarray Gene Data Samples

The data used to infer a PBN in our work was taken from the study of metastatic melanoma found in Bittner et al. [5], which has been extensively studied in the literature [17, 25, 27, 33]. The study extracts and analyses the gene expression profiles of 31 melanoma cells using microarray technology. To make sure that the gene expression levels used in inferring the corresponding PBN are those of genes when the network is at a steady state, the Kolmogorov-Smirnov (KS) statistic is applied, as discussed in more detail in Sect. 5.

To utilise a particular gene in DNA, see [7], assuming the cell is at a steady-state, the relevant segment of the molecule must first be transcribed, producing messenger RNA (mRNA) which is accessible to the rest of the proteins. The quantity of mRNA in a cell signifies the degree of protein production associated with a particular gene.

DNA microarrays measure the presence of mRNA within a cell. The microarrays consist of a surface with an array of robotically placed complementary DNA for the genes to be analysed. mRNA tightly bonds with complementary DNA, hence the microarray can be used to isolate different mRNA molecules. The process is known as hybridisation.

The quantity of mRNA within a cell is measured by tagging the mRNA with fluorescent molecules, hybridising them with a microarray, and exciting the fluorescent molecules. The emitted brightness is proportional to the amount of mRNA present.

Since the amount of mRNA differs depending on the gene, the data is normalised by dividing the values recorded by the values recorded from a reference probe. Since values recorded are non-negative, the ratio values are in the range of \([0,\infty )\). Furthermore, since we would expect the values of within the reference probe and the sample to not be different, the median for the ratio values is expected to be 1. These are the values provided by Bittner et al. [5] in the form of a matrix of size 8,150 (number of genes) by 31 (number of samples). A small sample of the raw data is shown later in Fig. 1(a).

For demonstrating our method of inferring a PBN, we work with the subset of melanoma genes analysed by Datta et al. [9], which are extensively studied in the literature [17, 18, 25, 27, 33], namely WNT5A, pirin, S100P, RET1, MART1, HADHB and STC2. This offers straightforward validation for our approach since it produces the same PBN.

It is worth noting that larger PBNs may be constructed following the pipeline described in this paper, and we have constructed the 28 node PBN given in [33] as well as a 70 node PBN which includes the 28 nodes already studied in [33] padded with the 42 nodes with the highest weighting of importance, using discriminative weights [5], which determine how a gene changes during the experiment compared to the control cells .

3.5 Coefficient of Determination

Coefficients of Determination (CODs) were described by Kim et al. [16] as a method to determine which gene determines the state of which other gene. A COD of a target variable, Y, with regards to an input variable, X, is a measure on how well the target variable can be predicted using the input variable. A predictive model f is used to predict the value of the target variable with and without the input variable, and compute the errors \(\bar{e}\) and e respectively. The relative change of error of the predictive model is the COD \(\theta \), given by Eq. 1:

$$\begin{aligned} \theta = \frac{\bar{e} - e}{\bar{e}} \end{aligned}$$
(1)

There are no constraints on what can be used as a predictive model. We opted for a perceptron. This is because there exists a closed-form solution for linear regression of the perceptron, described by Kim et al. [16], which can be used instead of training. This aids in lowering the computation time.

The weights of a perceptron, A, can be computed using the closed form solution:

$$\begin{aligned} {\begin{matrix} A &{}= R^+ \cdot C \\ R &{}= X \cdot X^T \\ C &{}= X \cdot Y \\ \end{matrix}} \end{aligned}$$
(2)

3.6 Discretisation

Since PBNs use discrete values, the gene data which consists of real values has to be discretised. Discretisation is a process where values get mapped from the real value domain to the integer domain. For the problem at hand, since genes can be in one of two states, the range of the function should be either 0 or 1. Hence the function should take the form of:

$$\begin{aligned} f: G \rightarrow G_d, x \ge 0, \forall x \in G, y \in \{0, 1\} \forall y \in G_d \end{aligned}$$
(3)

Such a method is described in detail in [34]. It consists of deciding upon a threshold value t with which all real values are compared. Each value then gets mapped to 0 if it is below the threshold, and to 1 otherwise, as given by Eq. 4.

$$\begin{aligned} G_d[x,y] = \left\{ \begin{array}{ll} 0 &{} G[x,y] < t\\ 1 &{} G[x,y] \ge t\\ \end{array} \right. \end{aligned}$$
(4)

The threshold may be any metric. Common metrics are means or medians. The threshold may also be the boundary between the top \(x\%\) of entries and the rest.

Shmulevich et al. [29] describe a process of using k-means clustering to cluster the data, and assigning values to the data points depending on the cluster they belong to. However, since half of the data points lie in the range (0, 1), and the other half is in the range \((1, \infty )\), the lower cluster ends up larger, resulting in a larger threshold that produces more zeros. This can be remedied by performing k-means clustering on the logarithms of the data points. This makes the ranges of both halves the same, producing more representative clusters.

4 Inference of PBNs

In this section we describe the inference method and how it can be implemented. Our approach to inferring a PBN starts with the real gene expression data in the form of a matrix G as input (see Fig. 1(a)), and produces a PBN (see Fig. 1(b)). The input matrix is of size \(m \times n\), where m is the number of genes and n is the number of samples.

The method we apply for inferring PBNs draws upon work done by Shmulevich et al. [30]. First, it requires the dataset to be discretised (recall Section 3.6). This process is performed in Algorithm 1.

figure a

Given a target gene, \(n_p\) sets of genes with the highest CODs are found. This is down following Algorithm 2.

figure b

A buffer of size \(n_p\) is initialised, and each possible combination of input genes have their CODs calculated. If a combination of inputs has a COD higher than at least one saved in the buffer, the buffer entry with the lowest COD gets replaced by the new combination of inputs. This results in a buffer full of input combinations with the highest CODs. One such buffer is initialised per target gene, resulting in \(n_p\) input combinations per target gene.

During run-time, a set of input genes is chosen with probability proportional to the COD of the set, and the next state is governed by the state of those input genes in conjunction with the predictive model that was saved. For all intents and purposes, the list with input gene, perceptron weights and probabilities are enough to construct a PBN, as the input genes convey the connectivity, and the perceptron weights convey the logic for that set of input genes. The process is summarised in Algorithm 3.

figure c
Fig. 1.
figure 1

Input and output for the inference method

5 Analysis

The analysis of the generated PBNs in our approach are based on steady-state distribution, which is fairly standard, e.g., see [17]. The PBN is run for T steps in order to get it within a steady state. Then it is run for the next N steps, recording the state it is at. To confirm whether or not the PBN is in a steady state after T steps, the Kolmogorov-Smirnov (KS) statistic is calculated for the two halves of N.

The entries recorded in N are split in to two halves - one containing states \([0, \frac{N}{2}]\), the other containing \([\frac{N}{2}+1, N]\). The entries are subsampled with the interval G. The histograms are converted to cumulative distribution functions (CDFs), and the maximum vertical distance between them is found, which is the KS statistic.

The significance test shows the probability of the two CDFs being drawn from the same distribution. If the PBN had not reached a steady state after T steps, the halves of N would be drawn from different distributions, which would be indicated by the KS test. The recorded states are a string of binary values. Therefore, for ease of analysis, they are used as gray-coded integers, and displayed on a histogram (cf. Fig. 2). This makes the horizontal distance on the histogram proportional to the Hamming distance between two network states.

6 Evaluation

We have implemented the pipeline using Python 3 and make it publicly available on https://github.com/UoS-PLCCN/pbn-inference.

We have constructed PBNs of size 7 from data produced by Bittner et al. [5] using different thresholds for the quantisation methods. The thresholds were (a) average of a gene expression; (b) median of a gene expression, and (c) k-means clustering of a gene expression. The data was quantised on a per-gene basis, with each gene having 10 triplets of input genes.

For the construction and validation of the histograms representing the steady-state distribution, we have chosen the parameters to be \(T = 10^6\), \(N=4 \cdot 10^6\), \(G=10\) and \(R = 100\). On a laptop with 32 GB of RAM and an Intel® Core™ i7-7700HQ Processor, each histogram took around 9 hours to produce. The results are shown in Fig. 2.

Fig. 2.
figure 2

SSDs of PBNs generated using different quantisation methods. States on the x-axis; SSD probability on the y-axis

It can be seen that the average and the median quantisation methods produce very similar histograms, with three peaks each, and the latter two peaks being in similar positions. The histogram generated using the PBN constructed from k-means clustering only has one prominent state, which can also be observed in the other two PBNs. It may be constructive to note that the few very prominent states in the histograms shown in Fig. 2 agrees with the assumption claimed by Kim et al. [17] that gene regulatory networks found in nature only occupy a small fraction of the possible state space.

For the purposes of direct comparison, we have trialled the proposed method in the DREAM (Dialogue on Reverse Engineering Assessment and Methods) challengeFootnote 1 which offers a benchmark for network inference (DREAM 3) [20] and scored 8th (out of 29).

7 Conclusion

In this work we described the inference a PBN directly from real gene data, collected using microrarray technology, which were taken when the system was at a steady-state. This kind of gene profiling is typically less costly to obtain than time series data, and includes more data points. Using the evaluation methods described in the literature, e.g., by Kim et al. [17], we have concluded that the pipeline works well for the examples provided. However, it is subject to fine-tuning the parameters. We have provided the method in a systematic pipeline which can be reproduced. We made it publicly available on github https://github.com/UoS-PLCCN/pbn-inference.

We note that the method scored 8th (out of 29) in the DREAM challenge and has been used to infer large PBNs (N = 200).

It is worth noting that the proposed method does not require a state transition probability matrix to be produced. It can be extracted from the PBN, however, the time required grows exponentially with the size of the PBN. This means that conventional mathematical methods in the literature that make use of the transition probability matrix may not always be applicable.

One concern is that the transitions get fitted to the quantised dataset. It is widely accepted that the states observed in the dataset are steady states of the cells. Since the transition rules get fitted to the steady states of the cells, the resulting PBN will be driven towards the steady states observed within the data. However, while it is certain that the method captures the long-run behaviour (steady-state) of the underlying gene regulatory network, there is little certainty that the PBN will behave with biological accuracy between the observed steady states. This concern could possibly be addressed by using time-series gene data to augment the method presented here, as this type of data captures the change of gene expression levels with respect to time. This promises to capture the behaviour at and between steady states, without reconstruction of the state evolution of the PBN, and is certainly worth exploring further in future work.