Keywords

1 Introduction

Most real-world systems can be represented by a graph with a nontrivial topology, called a complex network, where the nodes represent real-world entities (or variables) and the arcs represent the interactions between the entities themselves [15]. The goal is to better understand the functioning of the real system by studying the interactions between the entities involved and the properties of the network [12]. To build the complex network associated with a real-world system we need to carefully reverse engineering it. In other words, we have to build the complex network representing the real system by measuring only the outputs of the system over a period of time [22].

Unlike forward problems, where the effects can be accurately predicted by the system model given one or more causes as inputs, inverse problems are the opposite [20]. Here, the mathematical formulation of the problem is not known and therefore it is not possible to assert anything about it [5]. The only known information come from the observations of the system. Explaining the actual causes of a particular phenomenon, however, can be an extremely difficult, if not impossible, because small differences in effects can lead to large differences in causes, or the same effect can result from more than one cause [21]. In reality, estimating the parameters of the model is a difficult and time-consuming process due also to the large dimensional space, although some efficient algorithms were proposed [6, 7]. Several approaches can be found in literature to solve such inverse problems. For example, in [9] different mathematical approaches to solve this problem are explained. However, as stated in [4], mathematical approaches may not be sufficient to find a solution in reasonable time with high accuracy. Therefore, in recent years, many studies have been focused on the application of deep learning techniques to inverse problems, e.g., in image processing [14], astronomy [8], physics [16], biology [1], civil engineering [3], and so on.

In this paper, we present a novel approach based on machine learning techniques that has two goals: create an artificial environment capable of replicating the behaviour of a real environment based solely on observations of the variables of interest, and generate a complex network that reveals the relationships among variables in the system.

One of the most interesting areas of computer science where the application of reverse engineering is on the rise, is in the field of genetics [17]. Micro-array technology allows researchers to examine the presence of multiple genes in a DNA sample and their levels of expression [11]. Using our method, we were able to firstly artificially reproduce this behaviour and secondly create a genetic regulatory network showing the iterations between genes.

In Sect. 2 we will discuss modelling and creating an artificial environment that can replicate the real environment. In Sect. 3, we will discuss a methodology that uses the artificial environment to facilitate the creation of a gene regulatory network. In Sect. 4, we present the results obtained.

2 Modeling

To solve the inverse problem the goal we need to find the best model m such that

$$\begin{aligned} d=G(m) \end{aligned}$$
(1)

where \(G(\cdot )\) is an operator describing the explicit relationship between the observed data d and the model parameters [2]. In this section, we present a novel approach based on machine learning techniques to determine the best model m that can replicate the behaviour of a real environment when only the observations are available.

2.1 Environment

An environment basically consists of several elements (such as entities or variables) that interact with each other according to well-defined rules established during the design process. The two main components of an environment are: the agents and the rules that govern their interactions. Each agent is responsible for predicting the future value of a variable based on the current state of the environment. The interaction rules between the agents, on the other hand, define the mechanism by which the state of the environment evolves. Figure 1 depicts the environment scheme under consideration.

Fig. 1.
figure 1

Environment architecture

Given an environment with \(k\) variables, the state of the environment at time t, denoted by \(s^{(t)}\), is defined as a \(k\)-vector in which the generic element \(s^{(t)}_i\) is the value of the i-th variable at time t. The subsequent state of the environment \(s^{(t+1)}\) is computed as follows:

$$\begin{aligned} s^{(t+1)}=f\left( s^{(t)}\right) \end{aligned}$$
(2)

It can be seen from figure 1 that each agent in the environment is responsible for predicting the value of the variable to which it refers. Accordingly, there are as many agents as there are variables in the environment, so the equation 2 can also be written as follows.

$$\begin{aligned} s^{(t+1)}=f\left( s^{(t)}\right) = \left[ f_1\left( s^{(t)}\right) , f_2\left( s^{(t)}\right) ,\dots , f_k\left( s^{(t)}\right) \right] \end{aligned}$$
(3)

where \(f_i\) denotes the agent function and represents the i-th component of the function \(f\).

2.2 Agent

An agent can be formally described as a function \(f_i: S^k\rightarrow S\) that predicts the i-th component of the state of the environment at time \(t+1\), given its previous state. We use two different types of prediction scheme: simple agent prediction and delta agent prediction.

Fig. 2.
figure 2

Agent prediction scheme

Simple Agent Prediction In the first prediction scheme (Fig. 2a), an agent directly estimates the future value of the variable according to Eq. 4.

$$\begin{aligned} {\begin{matrix} s^{(t+1)}_i = f_i\left( s^{(t)}_1,s^{(t)}_2,\dots ,s^{(t)}_k\right) \end{matrix}} \end{aligned}$$
(4)

Agent Delta Prediction On the other hand, Fig. 2b illustrates the second agent prediction scheme. In this case, an agent predicts by how much the current value should be increased or decreased. In other words, it predicts the offset between the value of the i-th component of the state at time \(t+1\) and its value at time t. The output of the agent is calculated as follows.

$$\begin{aligned} s^{(t+1)}_i = s^{(t)}_i + \varDelta s^{(t+1)}_i = s^{(t)}_i + f_i\left( s^{(t)}_1,s^{(t)}_2,\dots ,s^{(t)}_k\right) \end{aligned}$$
(5)

Each agent in the environment can be considered as a black-box function with its own architecture and configuration. The basic model for each agent is shown in Fig. 3.

Fig. 3.
figure 3

The basic model of an agent.

The Predictor is the core of the agent and it is responsible for forecasting the value of a variable given a specific value as input. A predictor can be anything from a neural network to a regression model to a decision tree. Different types of predictors were tested to explore different possible configurations of the artificial environment. In Table 1 we report the available configurations for each predictor.

Table 1. Agent configurations for each type of predictor. 1) Predictor specifies the type of predictor used to predict the variable; 2) Machine learning tasks defines whether the predictor is a classifier and returns a class value or whether it is a regressor and predicts a real value; 3) Agent prediction scheme defines how the agent’s output is composed; 4) Training parameters are used to train the predictor.

Since an environment may contain different types of agents, it is important to use additional layers to ensure agent interoperability. Therefore two additional blocks have to be added before and after the predictor: encoder and decoder. The encoder’s role is to convert the agent’s input into a format that the predictor can understand. The decoder’s role, on the other hand, is to convert the predictor’s output into a format that is compatible with the architecture of the environment (\(s_i^{(t+1)}\)).

3 Methodology

In a gene regulatory network, each gene can be activated or inhibited depending upon the expression level of another gene, called regulator or regulatory gene. To find the regulatory genes of each gene, we used a simple principle: if varying the expression level of gene \(G_i\) causes a significant change in the expression level of gene \(G_j\), then \(G_i\) might be a good candidate as a regulator of gene \(G_j\) [10]. However, this approach can only be used if the mathematical formulation that determines the relationships between gene expression levels is known. Several methods for constructing a gene inference model have been proposed in the literature. For instance, continuous models such as the ordinary differential equations ODE, which are based on estimates of the inference level over time. Although the ODE approach provides detailed information about the dynamics of gene expression, it requires high quality data to build an accurate model [13]. In this section, we first look at how the model described above can be used to predict the expression level of genes, and then take advantage of the artificial environment to generate GRNs.

3.1 Gene Expression Level Prediction

We create an artificial environment by using: the observations (X) represented by a \(k\times n\) matrix, where \(k\) is the number of variables and \(n\) is the number of observations over time; and the environment’s configuration that consists of a collection of k tuples, each associated with an agent and containing the elements listed in Table 1, such as, the predictor, ML Task, prediction scheme, and training parameters. Algorithm 1 shows the procedure for creating an artificial environment and the evaluation of the environment thus created.

Since the model of the i-th agent depends on its associated configuration, each agent must be trained on its own training set consisting of pairs \((X,Y_i)\), where \(Y_i\) is a \(n\) vector and contains the values of the i-th variable shifted forward by one time unit. The artificial environment E is composed by all the agents created according to the architecture described in Sect. 2.1.

figure a
Fig. 4.
figure 4

Prediction of gene expression levels of ten genes over time. The red line represents the actual observations, while the blue line shows the predicted values over time.

Ideally, an optimal artificial environment without any perturbation and with identical initial conditions should be able to produce the same response as a real environment. Figure 4 depicts the artificial environment’s response over a time interval equal to twice the observation time. As we can see, the response of the artificial environment before 1000 time units is similar to the observation in the real environment. On the other hand, after 1000 time units, the artificial environment uses its knowledge to forecast the inference level of the genes for the subsequent time points.

To validate the effectiveness of the prediction, the cosine similarity measure is used. Given two matrices, \(X\in \mathbb {R}^{k\times n}\) and \(\hat{X}\in \mathbb {R}^{k\times n}\), representing, respectively, the actual and predicted data and the generic element \(x_{ij}\), corresponding to the value of the i-th variable at the j-th time, the similarity index \(Q_{\%}\) denotes how close the prediction is to the target values in percentage terms and it is defined as follows:

$$\begin{aligned} Q_{\%}= \frac{1}{k} \left( \sum _{i=1}^{k} \left| \frac{\sum _{j=1}^{n} x_{ij} \hat{x}_{ij}}{\sqrt{\sum _{j=1}^{n} x_{ij}^2} \sqrt{\sum _{j=1}^{n} \hat{x}_{ij}^2}} \right| \right) * 100, \end{aligned}$$
(6)

To choose the appropriate configuration of the agents so to maximize the similarity index, we used a local search algorithm [19, 24] that continuously improves a possible solution until the best configuration for the environment is found.

3.2 Gene Regulatory Network

Algorithm 2 shows the pseudo-code of the procedure used to determine the regulatory matrix by the artificial environment. Suppose we want to determine whether the gene \(G_i\) is a regulatory gene of \(G_j\). Using the artificial environment, the expression value of \(G_i\) can be manually set to \(V_i\) at time t (lines 5–6), and it will be possible to observe how the gene \(G_j\) responds to this change after time \(t+1\) (lines 7–8). Clearly, we cannot know whether or not the measurements obtained in this way are correct. For that, we would need a real dataset and would have to validate the measurements in the field using micro-array technology.

We denote by “initial time” \( t-1 \) the time when the environment does not react to perturbations and evolves naturally, by “transition time” t we denote the time when the perturbation acts on the gene \(G_i\) and changes its expression level manually, and finally by “final time” \( t + 1 \) we denote the time after the perturbation when the artificial environment reacts to it and the expression level of the genes is calculated taking this change into account. According to Eqs. 3, 7, 8, and 9 represent the state of the environment calculated at the initial time, at the transition time, and at the final time, respectively.

figure b

The initial time defined above is determined by the procedure in line 1 of the Algorithm 2. Basically, we used a linear regression model to determine when all variables in our artificial environment have reached stability. In other words, they have no fluctuations and move in a finite interval. When all variables exhibit this behaviour, the system is considered stable and the current time point corresponds to our initial time.

Although the environment state is calculated as usual using Eqs. 2, 8 used to determine the state of the environment at the transition time is slightly different. Indeed, in this case, the i-th component of the state must be set to a constant \( V_i\).

$$\begin{aligned} s^{(t-1)}=\left[ f_1\left( s^{(t-2)}\right) , f_2\left( s^{(t-2)}\right) , \dots , f_i\left( s^{(t-2)}\right) , \dots , f_k\left( s^{(t-2)}\right) \right] , \end{aligned}$$
(7)
$$\begin{aligned} s^{(t)}=\left[ f_1\left( s^{(t-1)}\right) , f_2\left( s^{(t-1)}\right) , \dots , V_i, \dots , f_k\left( s^{(t-1)}\right) \right] , \end{aligned}$$
(8)
$$\begin{aligned} s^{(t+1)}=\left[ f_1\left( s^{(t)}\right) , f_2\left( s^{(t)}\right) , \dots , f_i\left( s^{(t)}\right) , \dots , f_k\left( s^{(t)}\right) \right] . \end{aligned}$$
(9)

By comparing the expression levels at the initial and at the final time, it is possible to determine which genes have been affected by the perturbation (Fig. 5).

Fig. 5.
figure 5

Example of calculating the regulatory value of two genes \(G_j\) and \(G_h\) using \(G_i\) as their regulatory gene. \(\varDelta _{i,j}\) and \(\varDelta _{i,j}\) represent the variation between the expression level of genes at the final time and at the initial time. \(V_i\) is the value used to determine which genes are regulated by the \(i\)-th gene.

We define the regulatory value \(r_{i,j}\) as the offset between the expression level of the gene \(G_j\) at the final time and at the initial time, taking into account a perturbation of the gene \(G_i\) at the transition time (line 10). The regulatory value thus obtained is normalized using the maximum value \(M_j\) observed for the \(i\)-th gene during all the observation time.

$$\begin{aligned} r_{i,j}=\left| \frac{\varDelta _{i,j}}{M_j}\right| =\left| \frac{s^{(t+1)}_j-s^{(t-1)}_j}{M_j}\right| . \end{aligned}$$
(10)

The comparison between the expression levels of the \(G_j\) at the final and at the initial time could provide us with important information about which genes are correlated with each other. Accordingly, \(G_i\) is a gene regulator of \(G_j\) if and only if the regulatory value \(r_{i,j}\) is much larger quantity of a threshold value (\(>0\)). A regulatory matrix \(R\in \mathbb {R}^{k\times k}\) contains all regulatory values discovered for each gene pair. A gene regulatory network can be extracted using a regulatory matrix. Given a threshold value, it is possible to determine which genes are essentially the regulators of each gene. Figure 6 shows, as an example, how a regulatory matrix with only three genes is transformed into a gene regulatory network using as threshold value of 0.1.

4 Results

In this section, we discuss the results obtained considering ten datasets from DREAM4 Challenge, available in [23]. Each dataset consists of five or ten experiments, each with 21 observations of ten or one hundred genes recorded every fifty minutes. Since each dataset contains multiple experiments, the first half of the experiments were considered as a training set, while the second half served as a validation and testing set.

Fig. 6.
figure 6

Method for creating a gene regulatory network starting from the corresponding regulatory matrix with a threshold value of 0.1.

The obtained results are shown in Table 2. To examine the effectiveness of our methodology, two different groups of metrics were used. The similarity index was used to measure the reliability of the artificial environment. Performance metrics (such as accuracy, precision, sensitivity and specificity) were used to compare the predicted gene regulatory network with the target network. In the field of gene regulation, the typical elements of the confusion matrix also have a biological significance, as reported in [23].

  • True Positive (TP) denotes the number of regulatory mechanisms correctly predicted by our approach.

  • True Negative (TN) represents the number of arcs that are not present in both the predicted GRN and the target GRN.

  • False Positive (FP) denotes the number of regulatory mechanisms predicted by our approach that are incorrect.

  • False Negative (FN) denotes the number of regulatory mechanisms not detected by our approach.

As mentioned in the previous section, a threshold needs to be defined to generate a gene regulator from the regulatory matrix. In our experiments, we tested several threshold values ranging from 0 to 1 with a resolution of 0.01. However, due to space limitations, we have only listed the experiments with the highest accuracy and precision for each dataset in the results table.

Table 2. The results table shows the metrics used to validate our methodology. \(Q_{\%}\) is the similarity index, Thr. stands for threshold, Acc. for accuracy, Prec. for precision, Sens. for sensitivity, and Spec. for specificity.

As it can be observed, the higher the similarity index, the better the accuracy and precision of the generated gene regulatory network. Conversely, the precision decreases when the similarity index is lower, although the accuracy of the predicted gene regulatory network is still quite good. Another aspect that needs to be examined is the number of false negatives. When the number of genes increases, the number of regulatory mechanisms discovered is lower than expected. Conversely, when the number of genes is lower, the number of false negatives is acceptable and the overall sensitivity is higher than when the number of genes is higher. Similarly, false positives are the number of regulatory mechanisms added by our method but not included in the target regulatory network. The resulting gene regulatory network will therefore have more arcs than expected if false positives are high and that will affects the specificity.

5 Conclusions

In this paper, we presented a new method for genetic inference problem. Unlike other methods already present in literature such as Boolean Network [18], our technique allows us not only to perform experiments with our artificial environment, but also to determine the actual interaction between genes with a good accuracy and precision.

In fact, the ability to simulate how the expression levels of a collection of genes change over time is one of the most interesting features of our approach. However, as mentioned above, we are currently unable to demonstrate this part of the work, as all our experiments require to be validated in the laboratory.

Other interesting future work is to compare our method, which is based on an artificial environment, with other existing techniques that use instances with more than a thousand genes.

We are also aware that the threshold defined in Sect. 3.2 was not computed by any method and, therefore, this could be an obstacle. However, we are already working on a solution that consists of using a probabilistic algorithm to define the threshold directly.

Finally, we would like to thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.