Keywords

1 Introduction

Computational modelling in the sciences is founded on the notion of abstraction, the process of identifying and representing mathematically the salient features and interactions of a real system. Abstraction is a human led and interdisciplinary activity: many factors influence the decision of which features/interactions are eventually represented in the abstracted model, including the specialist interests of the scientists formulating the model, as well as computational constraints on the level of detail which can feasibly be implemented on the available hardware. Such factors inevitably vary between different research groups and at different times, leading to a proliferation of different models representing the same underlying phenomena.

Systems biology is a prominent field where models with different level of abstraction coexist. As an example, consider the process of gene expression, whereby genetic material stored in the DNA is transcribed into messenger RNA and eventually translated into protein. At the highest level of abstraction, which is frequently employed when studying high-throughput data, the process may be considered as a black box, and one may only be interested in characterising the statistical structures underlying observed levels of gene expression in different conditions [8]. Alternatively, one may want to mechanistically represent the dynamics of some of the steps involved in the process. The choice of which steps to include is again largely driven by extrinsic factors: examples in the literature range from highly detailed models where the synthesis of mRNA/protein is modelled through the binding and elongation of polymerase/ribosomes, to models where protein production is modelled as a single reaction with first order kinetics [1, 10].

Representing the same physical process by multiple models at different levels of abstraction immediately engenders the question of how different model outputs can be reconciled. As the models all represent the same underlying physical system, it can be plausibly assumed that such models will agree at least qualitatively for suitable parametrisations. In general, however, models may not agree quantitatively, and their discrepancy may be a function of the parameters. Understanding and quantifying such discrepancies would often be very valuable: first of all, it can shed light on how simplifications within models affect predictions, and secondly it may open the opportunity to construct computationally smaller surrogates of complex models. Such surrogates can be precious when modelling requires intrinsically computationally intensive tasks like inference, as they have less parameters.

In this paper, we approach the problem of reconciling models from a statistical machine learning angle. We start by sampling a sparse subset of the parameter space over which we evaluate the models’ outputs (generally by simulation). These evaluations are used as a training set to learn a correction map via a non-parametric regression approach based on Gaussian Processes. We show that our approach yields a consistent stochastic equivalence between models, which provably reconciles the predictions of the two models up to the second moment. We demonstrate the approach on two biological examples, showing that it can lead to non-trivial insights into the structure of the models, and provide an efficient way to simulate a complex model via a simpler model. Correction maps could be used to reduce the complexity of some common problems that we briefly discuss in the paper, e.g., model selection, synthesis and parameter estimation.

The rest of the paper is organised as follows: we start by giving a high level description of the problem and how we attack it. This is followed by a formal definition and a thorough illustration of the proposed solution, discussing its desirable theoretical properties. We then demonstrate the approach on two proof of principle examples, showing the potential for practical application of the method. We conclude the paper by discussing the relationship of our approach to existing ideas in model reduction and statistical emulation, as well as some possible extensions of our results.

2 Problem Definition

High level description of the approach. In this paper we consider the problem of performing analyses which require exhaustive sampling of a model’s outputs, \(\mathsf{{M}}\), the dynamics of which are expensive to compute. We are not interested in the origin of such a complexity, and we assume to be given an abstraction/surrogate of this model, \(\mathsf{{m}}\), which is reasonably less challenging to analyzeFootnote 1. For this reason, we want to investigate a general methodology to use \(\mathsf{{m}}\) as a reliable proxy to get statistics over \(\mathsf{{M}}\). Possible applications of this framework, which we discuss in Sect. 4, regard model selection and synthesis, inference, and process equivalence/control.

In general, we will assume both models to be stochastic processes, e.g. continuous time Markov Chains (CTMCs). Furthermore, we assume that the highly detailed model \(\mathsf{{M}}\) and the reduced model \(\mathsf{{m}}\) share some parameters \(\varvec{\theta }_\mathsf{{m}}\) and some observables/state variables \(\mathsf{{X}}\), but the large model will generally have many more state variables and parameters. In general we can compute some statistics of the shared state variables \(\mathsf{{X}}\) (e.g. mean), and that such computation will be considerably more expensive using the detailed model \(\mathsf{{M}}\).

As both models are devised as abstractions of the same physical system, it is not unreasonable to assume that the expected values of the shared variables will be similar for some parametrisations of the models. However, it is in general not the case that the distribution of the shared variables implied by the two models will be the same, and, as we vary the shared parameters \(\varvec{\theta }_\mathsf{{m}}\), even the expected values may show non-negligible discrepancies. Our aim is to develop a generally applicable machine learning approach to correct the output of the reduced model, in order to match the distribution of the larger model. This has strong practical motivations: one of the primary uses of models in general is to test hypotheses statistically against observed data, and it is therefore essential to capture as accurately as possible the implied variability on observable variables.

The strategy we will follow is simple: we start by sampling values of the shared parameters \(\varvec{\theta }_\mathsf{{m}}\), and compute the first two statistics of the observed variables as implied by both the large and reduced models (by simulation). In general, one may expect the variance implied by the larger model to be larger, as a more detailed model will imply more stochastic steps. We can therefore correct the first two statistics (mean and variance) of the reduced model output by adding a random function of the shared parameters \( \varvec{\theta }_{\mathsf{{m}}}\), which can be learned by rephrasing it as a regression task. We will work with heteroschedastic Gaussian Processes [5].

2.1 Formal Problem Statement

We assume to be given two models of the same underlying physical system:

  • a highly detailed model \(\mathsf{{M}}\), with state variables \(\mathcal {Y}\) and parameters \(\varvec{\theta }_\mathsf{{M}}\).

  • a reduced model \(\mathsf{{m}}\), with state variables \(\mathcal {X}\) and parameters \(\varvec{\theta }_\mathsf{{m}}\).

We will have \(|\mathcal {Y}| \gg |\mathcal {X}|\) and \(|\varvec{\theta }_\mathsf{{M}}| \gg | \varvec{\theta }_\mathsf{{m}}|\).

Assumptions. In general, the problem we want to tackle draws immediate connection to that of using \(\mathsf{{m}}\) as a statistical emulation of \(\mathsf{{M}}\). However, to exploit solutions from regression analysis and machine learning, we make further assumptions and discuss their limitations thorough the paper.

  1. 1.

    (Observables) we assume that it exists a non-empty set of state variables (or, more generally, observables) \({\mathsf{{X}}}\), common to both models, that is sufficient to compute our statistics of interest.

  2. 2.

    (Parameters) we assume that model \(\mathsf{{M}}\) is fully parametrized by \(\varvec{\theta }_\mathsf{{M}}\), a vector of real-valued parameters that breaks down as \(\varvec{\theta }_\mathsf{{M}}= [ \varvec{\theta }_\mathsf{{m}}\quad \varvec{\theta }_\mathsf{{f}}]^\top \), with \(\varvec{\theta }_\mathsf{{m}}\) shared between models \(\mathsf{{m}}\) and \(\mathsf{{M}}\). Here, we assume that \(\mathsf{{m}}\) is fully parametrizedFootnote 2 by \(\varvec{\theta }_{\mathsf{{m}}}\), which ranges in a domain set \(\varTheta \). We term \(\varvec{\theta }_\mathsf{{f}}\) free parameters in \(\mathsf{{M}}\), given \(\varvec{\theta }_{\mathsf{{m}}}\). We further assume to have a probability density \(p(\varvec{\theta }_\mathsf{{f}})\) for the free parameters, and a probability density \(p(\varvec{\theta }_\mathsf{{m}})\) for the shared parameters, encoding our knowledge about them.

  3. 3.

    (Sampling) we assume that, for every parametrization, each model’s dynamics is computable, i.e. it can be simulated.

Fig. 1.
figure 1

Correction maps as regression problems. From sampled estimates of a statistics \({\hat{\eta }}\) computed over domain \(\varTheta \) by models \(\mathsf{{M}}\) and \(\mathsf{{m}}\), we can define a correction \({\hat{\mathsf{M}}}_{{\hat{\eta }}}(x) \triangleq \mathsf{{m}}_{{\hat{\eta }}}(x) + {{\mathcal{M}_{\hat{\eta }}}}(x)\) from their difference. According to the variables/parameters involved, we collect multiple values of such correction; see also Fig. 2. Then, a regression over such values leads to a correction model for \(\mathsf{{M}}\)’s prediction, from \(\mathsf{{m}}\)’s ones; this differs from standard emulation as we retain the mechanistic description of the system in \(\mathsf{{m}}\). A map is modeled as a random function via Gaussian Processes (GPs) with heteroschedastic variance; GPs induce a distribution over functions (colored lines), and the map will be the expectation (red line). Maps allow to detect regions of \(\varTheta \) where the predictions of the models are in agreement, \({\mathcal{M}_{\hat{\eta }}}\rightarrow 0\), or roughly constant (i.e., at low sensitivity). (Color figure online)

In this work, we will consider correction maps conditioned to a reference statistic of interest, in the following sense.

Definition 1

(Statistic). A statistic \(\eta \) is any observable that we can compute from one, or from an ensemble of simulations of a model. We denote its estimator computed from model \(\mathsf {q}\) with parameters x as \(\mathsf {q}_{{\hat{\eta }}}(x)\), and its true value \(\mathsf {q}_{\eta }(x)\).

Valid examples of such statistics are, e.g., a single value or the expectation of a variable in \(\mathsf{{X}}\), the satisfiability of a temporal logical formula depending on variables \(\mathsf{{X}}\) that could be model-checked, etc. The richer the estimator, the higher number of samples are required for the estimator to converge to the true statistics. We will make use of estimators that are consistent in the following sense: \(\mathsf {q}_{{\hat{\eta }}}(x) \rightarrow \mathsf {q}_{\eta }(x)\) as the number of samples goes to infinity.

Definition 2

(Correction map). We define an \(\epsilon \)-correction map \({{\mathcal{M}_\eta }}: \varTheta \rightarrow \mathbb {R}^w\) for a statistics \(\eta \) to be a function that for any \(\varvec{\theta }_{\mathsf{{m}}}\in \varTheta \), satisfies

where \(\epsilon >0\) is the precision, and \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}})] = \int \mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}}; \varvec{\theta }_\mathsf{{f}})p(\varvec{\theta }_\mathsf{{f}})d\varvec{\theta }_\mathsf{{f}}\) is the expectation of the statistics computed from \(\mathsf{{M}}\), with respect to its free parameters \(\varvec{\theta }_\mathsf{{f}}\). \({\hat{\mathsf{M}}}\) is our corrected prediction of \(\mathsf{{M}}\).

Thus, \({\mathcal{M}_{\hat{\eta }}}\) can correct the outcomes of \(\eta \) assessed over \(\mathsf{{m}}\), \(\mathsf{{m}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}})\), to match (with a given precision) those that we would have by computing the statistics over \(\mathsf{{M}}\). Notice that the corrected outcome has no more dependence from the free parameters, as this is a correction in expectation with respect to \(\varvec{\theta }_\mathsf{{f}}\).

Notice that the correction is a w-dimensional vector, and hence \(\parallel \cdot \parallel _2\) is used as distance metric, and that the term \(\epsilon \) allows for tolerance in the correction’s precision. It is easy to define the optimal, zero-error, correction map:

$$\begin{aligned} {\mathcal{M}_\eta }^\star (\varvec{\theta }_{\mathsf{{m}}}) \triangleq \mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})] - \mathsf{{m}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}}) \, . \end{aligned}$$
(1)

However, the correction function \({\mathcal{M}_\eta }^\star (\varvec{\theta }_{\mathsf{{m}}})\) is impossible to compute exactly, as we cannot compute neither \(\mathsf{{M}}_{\eta }\) nor its marginalisation over \(\varvec{\theta }_\mathsf{{f}}\). Hence, we will learn an approximation of \({\mathcal{M}_\eta }^\star (\varvec{\theta }_{\mathsf{{m}}})\) trying to keep its error low so to satisfy Definition 2. We turn this problem into a regression task, as graphically explained in Figs. 1 and 2, and exploit Gaussian Processes.

3 Learning the Correction Map

In this section we will present our machine learning strategy in more detail. We consider the case of a scalar statistics, as w-dimensional ones can be treated by solving w independent learning problems.

3.1 Marginalising \(\theta _{\mathsf {f}}\)

In order to evaluate (2), we need to be able to compute or approximate the value \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})]\) with respect to the free parameters of \(\mathsf{{M}}\), for a any given \(\varvec{\theta }_\mathsf{{m}}\). As this integral cannot be treated analytically or numerically, due to the large dimensionality involved (the cost is exponential in \(|\varvec{\theta }_\mathsf{{f}}|\)), we will resort to statistical approximations. Before discussing them, let us comment on the distribution \(p(\varvec{\theta }_\mathsf{{f}})\), which is an input for our method. In particular, this distribution encodes our knowledge on the more plausible values of the free parameters. In case we have no information, we can choose an uniform distribution. On the other side of the spectrum, we may know the true value \(\varvec{\theta }_\mathsf{{f}}^*\) of \(\varvec{\theta }_\mathsf{{f}}\), and choose a delta Dirac distribution, which will dramatically simplify the evaluation of the integral. In this case, we can evaluate (2) as

$$\begin{aligned} {\mathcal{M}_\eta }(\varvec{\theta }_{\mathsf{{m}}}) \triangleq \mathsf{{M}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}}; \varvec{\theta }_\mathsf{{f}}^*) - \mathsf{{m}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}}) \, , \end{aligned}$$
(2)
Fig. 2.
figure 2

For any pair of \(\varvec{\theta }_{\mathsf{{m}}}\) and \(\varvec{\theta }_\mathsf{{f}}\) we can compute the statistics for \(\mathsf{{M}}\), \(\mathsf{{M}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}}; \varvec{\theta }_\mathsf{{f}})\). Since we do regression over \(\varvec{\theta }_{\mathsf{{m}}}\), we model the relation between such values and \(\varvec{\theta }_\mathsf{{f}}\) as the variance of a random variable, indexed by \(\varvec{\theta }_{\mathsf{{m}}}\), whose samples are the values, as a function of \(\varvec{\theta }_\mathsf{{f}}\). Marginalization is the exponential strategy that estimates such variance correctly; all downsampling strategies possibly over or under fitting. Accounting for the relation between this variance and \(\varvec{\theta }_{\mathsf{{m}}}\) can be achieved by heteroschedastic regression.

Moreover, the approximation, \(\int \mathsf{{M}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}}; \varvec{\theta }_\mathsf{{f}})p(\varvec{\theta }_\mathsf{{f}}) d\varvec{\theta }_\mathsf{{f}}\approx \mathsf{{M}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}};\varvec{\theta }_\mathsf{{f}}^*) \) is appropriate when the distribution \(p(\varvec{\theta }_\mathsf{{f}})\) is tightly concentrated around its mode \(\varvec{\theta }_\mathsf{{f}}^*\).

In general, however, when \(p(\varvec{\theta }_\mathsf{{f}})\) does not have this special form, we can resort to downsampling \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})]\), by generating k samples \({\varvec{\theta }_\mathsf{{f}}}^{(1)}\), \(\ldots \), \({\varvec{\theta }_\mathsf{{f}}}^{(k)}\) and approximating \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})] \approx \frac{1}{k} \sum _j \mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}};{\varvec{\theta }_\mathsf{{f}}}^{(j)})\). In the following, however, we will not necessarily aggregate the values \(\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}};{\varvec{\theta }_\mathsf{{f}}}^{(j)})\), and treat them individually to better account for the variance in the observed predictions.

3.2 Gaussian Processes

We will solve the learning problem resorting to Gaussian Process (GP) regression [12]. GPs are random functions, i.e. probability distributions over a function space, in our case functions \(f:\varTheta \rightarrow \mathbb {R}\), with the property that any finite dimensional projection \(f(\varvec{\theta }_1)\), \(\ldots \), \(f(\varvec{\theta }_k)\) is a multidimensional Gaussian random variable. It follows that GP are defined by a mean function \(\mu (\varvec{\theta })\), returning the mean at any point in \(\varTheta \), and by a covariance or kernel function \(k(\varvec{\theta }_1,\varvec{\theta }_2)\), for giving the covariance between any pair of points in \(\varTheta \). GP can be used to solve regression tasks in a Bayesian setting. The idea is as follows: we put a GP prior on the space of functions \(\{f~|~f:\varTheta \rightarrow \mathbb {R}\}\), typically by assuming constant zero mean and fixing a kernel functionFootnote 3, and then consider the posterior distribution given a set of observations \(Y= \{ y^{(i)}\}_I\), \(i \in I\), at input points \(X = \{\varvec{\theta }_{\mathsf{{m}}}^{(i)}\}\). If we assume that \(y^{(i)} = f(\varvec{\theta }_{\mathsf{{m}}}^{(i)}) + \epsilon _i\), with \(\epsilon _i\) a zero mean Gaussian noise term with variance \(\sigma ^2\), then we obtain that the posterior distribution given the observed data is still a GP, with mean and kernel functions that can be obtained analytically, see [12] for further details. GP regression is essentially the same if the observation noise \(\sigma ^2\) is different at different input points, i.e. \(\sigma ^2 = \sigma ({\varvec{\theta }_{\mathsf{{m}}}^{(i)}} )^2\), in which case we talk about heteroschedastic regression.

3.3 The Regression Task

Let \({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}\) for some index set \(i \in I\) be the input space, and \(\{ \langle {\varvec{\theta }_{\mathsf{{m}}}^{(i)}}, y^{(i)}\rangle \}_I\) our training points. In case we use Eq. (3) to evaluate the correction map, each \(y^{(i)}\) is a scalar value, and a standard regression schema based on Gaussian Processes can be used. In that case we assume samples of the correction map y to be observations from a random variable centered at a value given by the latent function

$$\begin{aligned} y^{(i)} \sim \mathcal{N} ({\mathcal{M}_\eta }({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}), \sigma ^2) \, . \end{aligned}$$
(3)

In this standard Gaussian Processes regression the noise model in the observations is assumed to be a constant \(\sigma ^2\) for all sampled points.

In the more general case we work with downsampling solutions that exploit k samples for the free variable, \({\varvec{\theta }_\mathsf{{f}}}^{(1)},\ldots , {\varvec{\theta }_\mathsf{{f}}}^{(k)}\). In that case, we have k correction values per training point, \( \Big \{ \langle {\varvec{\theta }_{\mathsf{{m}}}^{(i)}}, [y^{(i,1)} \cdots y^{(i,k)}]^\top \rangle \Big \}_I \), that we can use in a straightforward way to reduce to the above schema, or to estimate the variance of \(\mathsf{{M}}\) conditioned to its free variables. In these cases, the training set is \(\{ \langle {\varvec{\theta }_{\mathsf{{m}}}^{(i)}}, \overline{y}^{(i)} \rangle \}_I\), namely we do regression on the point-wise expectation of the correction (i.e., \(\overline{y}^{(i)}= \frac{1}{k}\sum _{j=1}^k y^{(i,j)}\)).

Estimator 1

(Empirical \(\overline{\sigma }\) -estimator). Set \(\overline{\sigma }\) to the empirical estimate of the variance across all correction values to exploit the schema in Eq. (4) with \(\sigma ^2 \triangleq \overline{\sigma }\).

Besides the simple first case, it is more interesting to account for a model of the variance in the observations of the predictions from a model; we discuss how this could be done in two ways.

Estimator 2

(Point-wise \(\sigma \)-estimator). Let \(\sigma (\cdot )\) the empirical estimator of the variance of the correction values, per training-point

$$\begin{aligned} \sigma ({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}) \triangleq \mathsf{Var}[y^{(i,1)}, \ldots , y^{(k,1)}] \, , \end{aligned}$$
(4)

then define a model of the variance as a point-wise function of the regression parameter, that is

$$\begin{aligned} y^{(i)} \sim \mathcal{N}\Big ({\mathcal{M}_\eta }({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}), \sigma ({\varvec{\theta }_{\mathsf{{m}}}^{(i)}})^2\Big ) \, . \end{aligned}$$
(5)

In this case, the variance that we expect in each prediction of the latent function is estimated from the data, thus leading to a form of heteroscedastic regression.

We can estimate with higher precision a model of the variation in the variance across the input space; to do that we perform regression of the variance change, and then inform the outer regression task of that prediction.

Estimator 3

(Nested \(\sigma \) -estimator). Consider the same estimator of the variance as above, and collect the variance estimates as \(\{ \langle {\varvec{\theta }_{\mathsf{{m}}}^{(i)}}, w^{(i)} \rangle \}_I\). Learn a latent function model of the true variance \(\sigma _\star (\cdot )\), which we assume to have a fixed variance \(\sigma _\star ^2\)

$$\begin{aligned} w^{(i)} \sim \mathcal{N}\Big ( \sigma _\star ({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}), \sigma _\star ^2 \Big )&\qquad y^{(i)} \sim \mathcal{N}\Big ({\mathcal{M}_\eta }({\varvec{\theta }_{\mathsf{{m}}}^{(i)}}), \sigma _\star ({\varvec{\theta }_{\mathsf{{m}}}^{(i)}})^2\Big ) \, . \end{aligned}$$
(6)

This is also a form of heteroschedastic regression, but nesting two GP regressions to account in a finer way for the variance’s changes.

3.4 Properties of the Correction Map

The correction map that we learn, for all variance schemes, is a statistically sound estimator of \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})]\), in the sense of being consistent.

Theorem 1

(Correctness). Let \(\mathsf{{m}}_{{\hat{\eta }}}(\varvec{\theta }_{\mathsf{{m}}})\) and \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{{\hat{\eta }}}(\varvec{\theta }_\mathsf{{m}})]\) be consistent estimators of \(\mathsf{{m}}_{\eta }(\varvec{\theta }_{\mathsf{{m}}})\) and \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})]\), then \({\hat{\mathsf{M}}}_{{\hat{\eta }}}(\varvec{\theta }_{\mathsf{{m}}}) \triangleq \mathsf{{m}}_{{\hat{\eta }}}(\varvec{\theta }_{\mathsf{{m}}}) + {{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})\) is a consistent estimator of \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{\eta }(\varvec{\theta }_\mathsf{{m}})]\), for any estimation strategy of \({\mathcal{M}_{\hat{\eta }}}\).

The result follows because \({\mathcal{M}_{\hat{\eta }}}\) converges to \({\mathcal{M}_\eta }\) due to properties of GPs [12], and because of the consistency of \(\mathsf{{m}}_{{\hat{\eta }}}\) and \(\mathbb {E}_{\varvec{\theta }_\mathsf{{f}}}[\mathsf{{M}}_{{\hat{\eta }}}(\varvec{\theta }_\mathsf{{m}})]\). The proof is sketched in Appendix A.2.

The correction map \({{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})\) is estimated from samples of the system, hence it is an approximation of the exact map defined by Eq. (2). Thus, it is a correction map in the sense of Definition 2. However, being the result of a GP regression, \({{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})\) is in fact a random function. Therefore, in measuring the error according to Eq. (1), we need to consider the average value of the random function \(\mathbb {E}[{{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})]\). The variance \(\mathsf{Var}[{{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})]\), instead, provides a way of computing the error \(\epsilon \), but in a statistical sense with confidence \(\alpha \): \(\epsilon \) can be estimated by averaging over \(\varTheta \) (with respect to \(p(\varvec{\theta }_\mathsf{{m}})\)) the half-width of the region containing \(\alpha \)% of the probability mass of \({{\mathcal{M}_{\hat{\eta }}}}(\varvec{\theta }_{\mathsf{{m}}})\).

The cost of all our approaches inherently depends on how many samples we pick from \(\varTheta \), the way parameters in \(\varvec{\theta }_\mathsf{{f}}\) are accounted for and the number of parameters in \(\varvec{\theta }_\mathsf{{m}}\). The sampling cost in general grows exponentially with \(|\varvec{\theta }_\mathsf{{m}}|\), and each Gaussian regression is cubic in the number of sampled values. Notice that, asymptotically, the cost of the empirical and nested \(\sigma \)-estimators is the same, as the two regressions are executed in series.

4 Applications

We discuss now several potential applications of our framework. The advantages of using our approach are mostly computational: the reduced model is simpler to analyze, yet it retains a mechanistic interpretation, compared to statistical emulation.

Model Building. Many common problems in the area of dynamical systems can be tackled by resorting to correction maps.

Problem 1

(Model selection). Let \(\mathsf{{M}}\) be a model, and \(\mathsf{{m}}_1\), \(\ldots \), \(\mathsf{{m}}_k\) a set of candidate models for correction, each one having a correction map \(\mathcal{M}_{\eta ,1}\), \(\ldots \), \(\mathcal{M}_{\eta ,k}\). The smallest-correction model \(\mathsf{{m}}^*\) for a statistic \(\eta \) is \( \mathsf{{m}}^*\triangleq \arg \min _{\mathsf{{m}}_i} \int \mathcal{M}_{\eta ,i}(\varvec{\theta }) d\varvec{\theta }\).

This criterion is certainly alternative to structural Bayesian approaches [3], which can be used to select the structurally smaller model within an equivalence class (see below). Also, allows to frame a model synthesis problem.

Problem 2

(Model synthesis). For a model \(\mathsf{{M}}\) with parameters \(\varvec{\theta }_\mathsf{{M}}\) and for a statistic \(\eta \): (i) partition \(\varvec{\theta }_\mathsf{{M}}\) into sets \(\varvec{\theta }_\mathsf{{m}}\) and \(\varvec{\theta }_\mathsf{{f}}\), (ii) generate a finite set of plausible reduced models with parameters \(\varvec{\theta }_\mathsf{{m}}\) and (iii) select the best one, according to the above model selection criterion.

In this case, the partition might aim at identifying in \(\varvec{\theta }_\mathsf{{M}}\) the model’s parameters which contribute the most to the variance for the statistics \(\eta \). Opportunities for control are also plausible if one can reduce the problem of controlling \(\mathsf{{M}}\) to “controlling and correcting” a reduced model \(\mathsf{{m}}\). This should be easier as \(\mathsf{{m}}\) is structurally smaller than \(\mathsf{{M}}\), and so is lower the complexity of solving a controller synthesis problem.

Parameter Estimation. Another application of our framework is in Bayesian parameter estimation of the parameters \(\varvec{\theta }_\mathsf{{m}}\) of the big model \(\mathsf{{M}}\), given observations D of variables \(\mathsf{{X}}\), using the small model and the corrected statistics to build approximations of the likelihood function \(p(D\mid \varvec{\theta }_\mathsf{{m}})\). For instance, this can be done by correcting the mean of variables \(\mathsf{{X}}\), and using the variance of the correction map as a proxy of the variance \(\mathsf{{X}}\) in \(\mathsf{{M}}\) after marginalisation of \(\varvec{\theta }_\mathsf{{f}}\). We can then plug the distribution of \(\mathsf{{X}}\) in a Bayesian inference scheme, and compute an approximate posterior over \(\varvec{\theta }_\mathsf{{m}}\).

Model Equivalence. Correction maps can also be used to compare processes, via weak forms of bisimilarity with respect to the observations and a statistics.

Definition 3

(Model equivalence). Models \(\mathsf{{M}}_1\) and \(\mathsf{{M}}_2\), for a statistic \(\eta \) and parameter sets \(\varvec{\theta }_\mathsf{{m}}\) and \(\varvec{\theta }_\mathsf{{f}}\), are \(\eta \)-bisimilar conditioned to \(\mathsf{{m}}\), \(\mathsf{{M}}_1\equiv _{\mathsf{{m}}}^{\eta } \mathsf{{M}}_2\), if and only if for all \(\varvec{\theta }_\mathsf{{m}}\in \varTheta \), it holds \(\mathcal{M}_{\eta ,1}(\varvec{\theta }_\mathsf{{m}}) = \mathcal{M}_{\eta ,2}(\varvec{\theta }_\mathsf{{m}})\). A class of equivalence of models with respect to \(\mathsf{{m}}\) and \(\eta \) is the set of all such bisimilar models.

This notion of bisimilarity resembles conditional dependence, as we are saying that two models are equivalent if an observer corrects both the same way. In this case, \(\mathsf{{m}}\) is a universal corrector of \(\equiv _{\mathsf{{m}}}^{\eta }\), as it can correct for all the models in the class. The class of models that are equivalent to a model \(\mathsf{{M}}\) is \(\{ \mathsf{{M}}^*\mid \mathsf{{M}}^*\equiv _{\mathsf{{M}}}^{\eta } \mathsf{{M}}\}\) – i.e., the class of models with correction zero; notice that in this case \(\varvec{\theta }_\mathsf{{f}}=\emptyset \). The previous definition can be relaxed by asking that \(|\mathcal{M}_{\eta ,1}(\varvec{\theta }_\mathsf{{m}}) - \mathcal{M}_{\eta ,2}(\varvec{\theta }_\mathsf{{m}})|\le \epsilon \), for all \(\varvec{\theta }_\mathsf{{m}}\in \varTheta \).

Remark 1

Criterion \(\equiv _{\mathsf{{m}}}^{\eta }\) is a weaker form of probabilistic bisimilarity, namely if \(\mathsf{{M}}_1 \equiv \mathsf{{M}}_2\) are bisimilar, then \(\mathsf{{M}}_1 \equiv _{\mathsf{{m}}}^{\eta } \mathsf{{M}}_2\) for some \(\mathsf{{m}}\) and any statistics of interest. For instance, for CTMCs, this follows because \(\equiv \) implies that \(\mathsf{{M}}_1\) and \(\mathsf{{M}}_2\) have the same infinitesimal generators for any parameter \(\varvec{\theta }_\mathsf{{m}}\) and \(\varvec{\theta }_\mathsf{{f}}\), hence the outcomes of \(\mathsf{{M}}_1\) and \(\mathsf{{M}}_2\) are indistinguishable, and so are their statistics.

Fig. 3.
figure 3

Example models tested in this paper. Top panel: the Henri-Michaelis-Menten model, where \(\mathsf{{m}}\) is derived when \({\mathbf{C}}: [\mathsf{{E}}{}]_0 + [{\mathsf{{ES}}} {}]_0 \ll [{\mathsf{{S}}}{}]_0 + K_\mathsf{MM}\). Bottom panel, a protein translation network where \(\mathsf{{m}}\) when \({\mathbf{C}}: \beta \gg \alpha \).

5 Examples

We investigate two examples to better illustrate our method.

5.1 Model Reduction via QSSA

Consider the irreversible canonical enzyme reaction with its exact representation (here, model \(\mathsf{{M}}\)), for enzyme E, complex ES , substrate S and product PR (Fig. 3, top left panel). When the concentration of the intermediate complex does not change on the time-scale of product formation, product is produced at rate \(f \triangleq {V_\mathsf{M}{\mathsf{{S}}}{}}/(K_\mathsf{M} + {\mathsf{{S}}}{})\) where \(V_\mathsf{M} = k_2([\mathsf{{E}}{}]_0 + [{\mathsf{{ES}}} {}]_0) \) and \(K_\mathsf{MM}=({k_{-1} + k_2})/{k_1}\). This is the Henri-Michaelis-Menten kinetics and is technically derived by a quasi-steady-state assumption (QSSA), i.e., \(\dot{{{\mathsf{{ES}}} {}}}= \dot{{\mathsf{{E}}{}}}=0\), that is in place when \({\mathbf{C}}: [\mathsf{{E}}{}]_0 + [{\mathsf{{ES}}} {}]_0 \ll [{\mathsf{{S}}}{}]_0 + K_\mathsf{MM}\), where \([x]_0\) is the initial amount of species x. \(\mathsf{{m}}\) is thus the QSSA reduced model (Fig. 3, top right panel).

Fig. 4.
figure 4

Correction of the product formation at the transient time \(t_*=1.5\), for a mean field model of irreversible canonical enzyme reaction and its simplified Henri-Michaelis-Menten form. Here \(k_1 = 2\), \(k_{-1}=1\) and \(k_2 = 1.5\), \([S]_0=60\) and \([P]_0=0\). Regression over \([\mathsf{{E}}{}]_0\) is done with 40 training points from (0, 100], and the correction in Eq. (3) as \(\mathsf{{M}}\)’s free variables are part of the Michaelis-Menten constant.

We interpret these two models as two systems of ordinary differential equations, see Appendix A.1, and learn a correction for the following statistics

$$\begin{aligned} \eta&: \mathbb {E}[{\mathsf{{PR}}}(t_*)] \text {, with } {\mathsf{{PR}}}(t_*) \text { the number of products at time } t_*\end{aligned}$$
(7)

For non-degenerate parameter values both models predict the same equilibrium state, where a total transformation of substrate into product has happened, \(\mathbb {E}[{\mathsf{{PR}}}(t)] \rightarrow [{\mathsf{{S}}}{}]_0\) for large t. Thus, we are not interested in correcting the dynamics of \(\mathsf{{m}}\) for long-run times, but rather in the transient (i.e., for small \(t_*\)).

Also, as the QSSA hypothesis does not hold for certain initial conditions, we set \(\varvec{\theta }_\mathsf{{m}}= \{ [\mathsf{{E}}{}]_0\}\) as the regression variable, and set \([S]_0=60\) and \([P]_0=0\). The other parameters are \(k_1 = 2\), \(k_{-1}=1\) and \(k_2 = 1.5\) with unit (mole per second) \(^{-1}\). In terms of regression, we pick 40 samples of the initial enzyme amount from (0, 100], and set \(t_*= 1.5\) as it is a time in the transient (manual tuning). This particular example is rather simple, as the free parameters of \(\mathsf{{M}}\) are part of the Michaelis-Menten constant and fixed, so we use the simpler correction of Eq. (3). Also, knowing when the QSSA holds gives us an interval where we expect the correction to shrink towards zero. The map is shown in Fig. 4, which depicts the expected concordance among the correction map and validity of the QSSA.

5.2 Model Reduction via Time-Scale Separation

Consider a gene switching among active and inactive states of mRNA transcription to be ruled by a telegraph process with rates \(k_{\mathsf{off}}\)/\(k_{\mathsf{on}}\). A reaction model of such gene \(\mathsf{{G}}{}\), protein \({\mathsf{{PR}}}{}\), messenger mRNA with transcription/translation rates \(\alpha \)/\(\beta \) as in Fig. 3, bottom left panel.

Fig. 5.
figure 5

Comparison between the dynamics of the full and the reduced models from Sect. 5.2, with \(k_{\mathsf{off}} = k_{\mathsf{on}} = \delta _{\mathsf{P}} = 10^{-2}\), \(\delta _{\mathsf{RNA}} =10\), \([\mathsf{{G}_{\mathsf{act}}}]_0=1\). Values for transcription (\(\alpha \)) and translation (\(\beta \)) are reported in the figure. The reduced model predicts spiked dynamics, leading to a unimodal distribution of proteins. The larger model, instead, can either predict protein buffers, when there is no time scale separation (\(\alpha =\beta \)), or multiple equilibria, leading to a multimodal distribution of protein counts. Observe that the expectation on the number of proteins in the long run spans over different order of magnitudes, according to the relation between \(\alpha \) and \(\beta \).

Here the gene switches among active and inactive states, with rates \(k_{\mathsf{on}}\) and \(k_{\mathsf{off}}\), and PR feedbacks positively on inactivation. Proteins and mRNAs are degraded with rates \({\delta _{\mathsf{P}}}\) and \({\delta _{\mathsf{RNA}}}\). In the uppermost part of the diagram species marked with a \(*\) symbol are not consumed by a reaction, i.e., mRNA transcription is \(\mathsf{{G}_{\mathsf{act}}}\rightarrow \mathsf{{G}_{\mathsf{act}}}+\mathsf{{mRNA}}{}\). This model can be easily encoded in a Markov process, as discussed in A.1.

We can derive an approximation to \(\mathsf{{M}}\) where the transcription step is omitted. This is valid when \({\mathbf{C}}:\beta \gg \alpha \) (time-scales separation), namely for every copy of mRNA a protein is immediately translated.

Correction of protein dynamics. We build a correction map with \(\varvec{\theta }_\mathsf{{m}}=\{\beta \}\). In this case the telegraph process is common to both models, but \(\alpha \) and \(\delta _{\mathsf{RNA}}\) are free variables of \(\mathsf{{M}}\); here we assume to have a prior delta distribution over the values of mRNA’s degradation, so we set \(\varvec{\theta }_\mathsf{{f}}= \{\alpha \}\). For some values of the transcription rate condition C does not hold; in this case it is appropriate to account for \(\alpha \)’s contribution to the variance in the statistics that we want to correct, when we do a regression over \(\beta \). Note that also \(\beta \) is part of \({\mathbf{C}}{}\).

Model \(\mathsf{{M}}\) leads to stochastic bursts in PR’s expression when the baseline gene switching is slower than the characteristic times of translation/transcription. Here we set \(k_{\mathsf{off}} = k_{\mathsf{on}} = 10^{-2}\), and assume mRNA’s lifespan to be longer than protein’s ones (also in light of condition C), so \(\delta _{\mathsf{RNA}} =10 \delta _{\mathsf{P}} = 10^{-2}\). We simulate both models with one active gene, \([\mathsf{{G}_{\mathsf{act}}}]_0=1\); example dynamics are shown in Fig. 5, for \(\beta =100\) and \(\alpha \in \{1,100\}\). We observe that, when \({\mathbf{C}}{}\) does not hold (\(\alpha =\beta \)) the protein bursts increases of one order of magnitude, and the long-run probability density function for the proteins, \(p_{t}({\mathsf{{PR}}})\), becomes multimodal.

We define two statistics. One measures the first moment of the random variable that models the number of proteins in the long run; the other is a metric interval temporal logic formula [2], expressing the probability of a protein burst within the first 100 time units of simulation.

$$\begin{aligned} \eta _1&: \mathbb {E}[{\mathsf{{PR}}}(t_*)] \text {, with } {\mathsf{{PR}}}(t_*) \text { the number of proteins at time } t_*\gg 0 \end{aligned}$$
(8)
$$\begin{aligned} \eta _2&: \mathbb {E}[p(\varphi )] \text {, with } \varphi \triangleq \mathbf {F}_{[0,100]} {\mathsf{{PR}}}(t) > 200 \end{aligned}$$
(9)

The former is evaluated by a unique long-run simulation of the model, as its dynamics are ergodic. For the latter we estimate the satisfaction probability of the formula via multiple ensembles, as in a parametric statistical model checking problem [4].

Fig. 6.
figure 6

For the regression task we sample 50 values of \(\beta \) from [0.1, 100]. For each value of \(\beta \), we sample 50 random values of \(\alpha \) in the same interval. For each pair \((\alpha , \beta )\) we estimate the statistics for both models (green, \(\mathsf{{M}}\); red, \(\mathsf{{m}}\)), and obtain 50 correction values indexed by each \(\beta \) (green). Correction values (blue and pink) are the expectation and variance of the difference between \(\mathsf{{M}}\)’s and \(\mathsf{{m}}\)’s predictions. We transform them logarithmically before doing regression; observe that, on the linear scale, the correction is of the order of \(10^3\) with variance \(10^7\) (midpoint value \(\beta \approx 50\)). Gaussian Process regression (right panel) is performed with a constant \(\sigma ^2=0.2\), Eq. (4) and with the \(\sigma \)-estimator, Eq. (6). Values are re-scaled linearly, and 95 % log-normal confidence intervals are shown; regression highlights that the posterior variances are similar, but the fixed-variance schema tends to underfit or overfit the heteroscedastic variance (assumed it to be closer to the true one). (Color figure online)

For the regression task we sample 50 values of \(\beta \), in the range [0.1, 100]. For \(\alpha \), instead, we sample 50 random values in the same interval, for each value of \(\beta \); notice that with high probability we pick values where \({\mathbf{C}}{}\) does not hold, so we might expect high correction factors. Data generated and the regression results are shown in Fig. 6, for both fixed-variance regression, empirical \(\sigma \)-estimator in Eq. (4) and with the \(\sigma \)-estimator, Eq. (6). Because variance spans over many orders of magnitude, regression is performed in the logarithmic space. Results highlight a general difference between the posterior variance between the two estimators.

Fig. 7.
figure 7

Correction map for the expected satisfaction probability of the linear logic formula \(\eta _2\) in Eq. (9). Comparison between the point-wise \(\sigma \)-estimator and the empirical \(\overline{\sigma }\)-estimator. Sampled data are shown in Appendix Fig. 8.

For the second statistics, data is generated from an initial condition where one gene is inactive, \([\mathsf{{G}_{\mathsf{in}}}]_0=1\). Notice that the expected time for the gene to trigger its activation is \(1/k_{\mathsf{on}} = 100\) (the time upper-bound of the formula), so for some parametrization there will be non-negligible probability of having no protein spike above threshold 200. The formula satisfaction probability is evaluated with 100 independent model runs, and data generated are shown in Appendix Fig. 8. Regression results are shown in Fig. 7, where the point-wise \(\sigma \)-estimator and the empirical \(\overline{\sigma }\)-estimator are compared, highlighting the more robustness of the former with respect to the sampled bottom-left outlier.

6 Conclusions

Abstraction represents a fundamental tool in the armoury of the computational modeller. It is ubiquitously used in biology as an effective mean to reduce complexity, however systematic analysis tools to quantify discrepancies introduced by abstraction are lacking. Prominent examples, beyond the ones already mentioned, include models with delays, generally introduce to approximately lump multiple biochemical steps [6], or physiological models of organ function which adopt multi-scale abstractions to model phenomena at different organisational levels. These include some of the most famous success stories of systems biology, such as the heart model of [11], which also constitutes perhaps the most notable example of a physical systems which has been modelled multiple times at different levels of abstraction. Employing our techniques to clarify the quantitative relationship between models of cardiac electrophysiology would be a natural and very interesting next step.

Our approach has deep roots in the statistical literature. In this paper we have focussed on the scenario where we try to reconcile the predictions of two separate models, however the complex model was simply used as a sample generator black box. As such, nothing would change if instead of a complex model we used a real system which can be interrogated as we vary some control parameters. Our approach would then reduce to fitting the simple model with a structured, parameter dependent error term. This is closely related with the use of Gaussian Processes in the geostatistics literature [7], where simple (generally linear) models are used to explain spatially distributed data, with the residual variance being accounted for by a spatially varying random function. Another important connection with the classical statistics literature is with the notion of emulation [9]. Emulation constructs a statistical model of the output of a complex computational model by interpolating sparse model results with a Gaussian Process. Our approach can be viewed as a partial emulation, whereby we are interested in retaining mechanistic detail for some aspects of the process, and emulate statistically the residual variability. In this light, our work represents a novel approach to bridge the divide between the model-reduction techniques of formal computational modelling and the statistical approximations to complex models.