Keywords

1 Introduction

The progress that computer models have made over the past few decades in various areas of biology is impressive, with an increasing demand for the use of these tools in several scenarios that require quantitative predictions, such as infectious diseases [1, 2], change in species dynamics due to climate change [3, 4], protein signaling in cells [5, 6], etc. For instance, the recent pandemic of the Coronavirus SARS-CoV-2, responsible for the disease COVID-19 [7, 8]. The rapid spread of the disease is problematic because of the risk of death, and also for most of the health systems are not prepared to receive so many sick people in hospitals [9]. This kind of emergency scenario illustrates the importance of cooperation between researchers from different areas to deal with these situations by creating some understanding about the disease and propose strategies to slow down its reach, as fast as possible. The use of mathematical models can be very useful in situations like this, as long as it has the necessary ingredients to describe the basic aspects of the biological system of interest, and it is well-calibrated with real data [10].

The utility of a model must precede a process of certification to guarantee its match with reality [11]. During this process, it is useful to understand how certain variations in a model input parameter can affect its outcome (response) [12]. This knowledge allows one to detect which phenomena are more important in the real system [13, 14]. For instance, in the Coronavirus example, the effect of the response time to the outbreak by local governments and the various mitigation strategies has been evaluated [9].

In this sense, the mathematical model parameters are the “fundamental blocks” of the predictive tool and understand how each of these “pieces” affects the outcomes related to the biological system of interest is a key point for proper use of this quantitative arsenal. Global sensitivity analysis can be very useful to clarify this understanding, once it can provide information about the dependence of the model outcome with respect to each one of its input parameters [15]. Following that idea, this chapter aims to present a tutorial that illustrates the process of global sensitivity analysis in biological systems via Sobol’ indices [16].

In Sect. 6.2 we present the setting for the mathematical model representation. Sobol’ indices method is described in the Sect. 6.3, where the sensitivity analysis ideas and goals are also characterized. After that, Sect. 6.4 brings a brief overview of surrogate models as a computationally efficient strategy to approximate the Sobol’ indices. With the theory detailed, Sect. 6.5 is responsible for presenting the Sobol framework and developing three biological examples: the predator–prey model. an NF-\(\kappa \)B signaling pathway model and the SIR epidemiological model. All of them reproducible by the reader applying the codes available in a public repository.

2 Mathematical Modeling

Modeling is the process of creating a (simplified) representation of the real process [17]. The objective is not being completely faithful to reality, but just being able to present its main characteristics or facilitate the identification and understanding of its mechanisms and processes [10]. To that goal, we can make several types of models for a single reality. With a good model, you can analyze many things about your problem to obtain more knowledge about it. Between the modeling objectives, we can highlight the understanding of the mechanisms that govern the phenomena of interest, prediction of the future or of some state that is currently unknown, and control by constraining a system to produce a desirable condition [18]. Because of this, mathematical modeling is often required to deal with biological processes.

The process of modeling involves some paradigms and steps but, more objectively, any model is composed by three elements: (i) the input, that comes from the previous useful information about the system state; (ii) the output, which is the information about the real problem; and (iii) the model, which is the system representation that maps input to the output [19]. We can make a parallel with a manufactured production defining the raw material as the input, the processed product as the output, and the machine responsible for transform one in another as the model. Intuitively, with a more sophisticated machine, we can create a more elaborate object as well as with a more detailed model we can obtain a better representation of the reality. However, including further details is necessary for a more complex representation. Greater complexity implies greater difficulty in analyzing all aspects of interest. Thereby, the hypotheses must balance the most important elements of the phenomena and a reasonable form of representing the relationship of these elements with what is interesting to quantify or evaluate. The model itself is this relation [20]. For example, the evolution of some animal species is related to the food resources available but also with the topographic characteristics of the environment. Assuming the species has lived in an environment for a long time, we can neglect then it and only take in count the first one. Sure, the effect of this approximation must be tested to guarantee your model is sufficiently compatible with reality. Additionally, the number of individuals of the species will increases (or decreases) according to some units of time. The change in a month is probably different than in a year. This time scale relation may be relevant or not, depending on the process treated.

Technically we represent the important input components of a real system in the model by parameters. In mathematical words, these parameters are quantities whose values are dependent on specific aspects of the phenomena and that controls how the model entails the desired predictions [10]. This prediction is called quantity (or quantities if are more than one) of interest (QoI). So, we can think of the mathematical model as a “machine” that uses the parameters as ingredients to obtain the QoIs. Formally, the model is defined by a mathematical operator \(\mathcal {M}\), so that the QoIs are given by

$$\begin{aligned} \mathbf{y} = \mathcal {M}(\mathbf{x} ), \end{aligned}$$
(6.1)

where the vector \(\mathbf{x} \in \mathbb {R}^m\) contains all the model parameters and the n QoIs are reunited in a vector \(\mathbf{y} \in \mathbb {R}^n\). The structure of a general computational model is illustrated in Fig. 6.1. If the model is time dependent, one can lump the QoI at the instants of interest into a vector. To simplify the notation, the time dependence will be omitted in the theoretical-based discussions.

Fig. 6.1
figure 1

Schematic representation of a general computational model

3 Sensitivity Analysis

Sensitivity Analysis (SA)  is a process in which the contribution of each input parameter of a mathematical model, to its response, is identified [21]. In particular, for nonlinear systems, not only the parameters individually affect the QoIs, but also the interactions between then. In this case, the joint-effect of parameter variation must also be quantified. This feature makes difficult the process of SA for high-dimensional systems because several orders of interactions must be computed, and their influences are not trivial in general. Another important issue in the modeling process is to take into account the underlying uncertainties.  The lack of knowledge about the real system and the natural variability of some parameters create some difficulties in any attempt to understand a phenomenon, that can be compensated by the use of a stochastic model [14, 19]. This is the domain of investigation of the Uncertainty Quantification (UQ) theory.

UQ and SA areas gained greater notoriety in the last decades, where it was recognized that is extremely important not only to deal with multiple sources of uncertainties in a mathematical model but, in some contexts, also apply strategies to reduce their effects, especially if their presence can be catastrophic for the system [22]. An alternative approach is also possible, where one tries to take advantage of the uncertainties to improve the performance of a given system. Although UQ  and SA literature are already considerable, it is common to find texts that confuse these two concepts [13]. The first one deals with the quantitative evaluation of how the uncertainties in the input variables are transported to the model response, which is done through the calculation of the so-called uncertainty propagation process. On the other side, the second concept is related to the quantitative evaluation of how uncertainty in the input parameters contribute (individually or jointly) to change the model response. In UQ it is mandatory to prescribe a characterization of the uncertainties so that after the calculation of their propagation, one has a general overview of the uncertainty in the model’s response. In the SA perspective, the idea is to discover how a certain parameter changes the system response when changed [23].

The SA methods can be distinguished in two large groups: The Local SA (LSA) methods and the Global SA (GSA) methods [24]. From a general perspective, the difference between them is how the parameter space is explored in each case. Local methods are normally based in partial derivatives or gradients but are (totally) dependent on the point of the parameter space for which the model is evaluated. In this way, the LSA results do not reflect the general dependency of model outcome concerning the selected input parameter. Moreover, for nonlinear models, the interactions of the parameters are very important for the response and local analysis can not capture those effects. Differently, GSA methods explore screening or variance decomposition to cover the local analysis limitations [23]. In this chapter the focus of our interest in GSA, as a form to discover the most important parameters, that is, those that contribute the most to the model response (factor prioritization) and those which contribute very little and can potentially be fixed (factor fixing) [13].

3.1 Sobol’ Indices

There are several methods to perform SA present in scientific texts [13, 14]. But as said in the previous section, here we have preference for global methods. For the global sensitivity analysis framework to be robust and general, it makes sense to select a method that is simple to implement and use. The Sobol’ indices [16] is a variance-based method very popular in recent literature [12] after the dissemination of surrogate model ideas, and very soon will be clear why.

In this framework, the system is analyzed from a probabilistic perspective that considers the model input as a random vector \(\mathbf{X }\), characterized by a joint Probability Density Function (PDF) \(f_\mathbf{X }\) with support \(I_\mathbf{X }\). The stochastic version of the model is represented as

$$\begin{aligned} \mathbf{Y } = \mathcal {M}(\mathbf{X }), \end{aligned}$$
(6.2)

which has joint PDF \(f_\mathbf{Y }\) that is unknown before the propagation of uncertainties [23]. Note that the notation with capital letters is chosen to describe random objects, and the one with small letters still make sense for the deterministic case. Assuming for simplicity that the QoI is a scalar value, with the random input parameters  being composed by independent and identically distributed (iid) uniform parameters \(X_i\), scaled to have support \([0,1]^m\), the Hoeffding-Sobol decomposition is given by

(6.3)

where

$$\begin{aligned} \begin{aligned} \mathcal {M}_0&= \mathbb {E}\,[Y] \,, \\ \mathcal {M}_i(X_i)&= \mathbb {E}\,[Y \mid X_i] - \mathcal {M}_0 \,,\\ \mathcal {M}_{ij}(X_i,X_j)&= \mathbb {E}\,[Y \mid X_i,X_j] - \mathcal {M}_0 - \mathcal {M}_i - \mathcal {M}_j\,,\\ \cdots \\ \end{aligned} \end{aligned}$$
(6.4)

that is, \(\mathcal {M}_0\) is the mean value, and the terms of increasing order are conditional expectations defined in a recursive way, that characterize an unique orthogonal decomposition of the model response [16, 21].

Following this idea, we can now decompose the total variance of the response as follows

(6.5)

where \(\text {Var}\left( \mathcal {M}_\mathbf{u }(\mathbf{X} _\mathbf{u }) \right) \) expresses the conditional variance for the subvector \(\mathbf{X} _\mathbf{u }\), containing the variables which indices are indicated by the subset \(\mathbf{u} \) [21]. Thus, the Sobol’ index associated to the subset \(\mathbf{u} \) is defined as the ratio between the contribution given by the interaction among the components of \(\mathbf{u} \) for the model variance, and the total variance itself [14], i.e.,

$$\begin{aligned} S_{\mathbf {u}} = \frac{ \text {Var}\left( \mathcal {M}_\mathbf {u}(X_\mathbf {u}) \right) }{ {\text {Var}}\left( Y \right) } . \end{aligned}$$
(6.6)

As a result of this equation we can verify that, for \(\mathbf{u} \subset \{ 1,\ldots ,n \}\), \(\mathbf{u} \ne \emptyset \),

(6.7)

that is, by construction the sum of all the Sobol’ indices must be equal to the unit.

The terms

$$\begin{aligned} S_i = \frac{\text {Var}\left( \mathcal {M}_{i}(X_{i}) \right) }{\text {Var}(Y)}, \qquad i=1, \ldots , n \end{aligned}$$
(6.8)

are called the first-order Sobol’ indices for the single variable \(X_i\) and denote the individual effect of \(X_i\) for the total model variate. Similarly,

$$\begin{aligned} S_{ij} = \frac{\text {Var}\left( \mathcal {M}_{ij}(X_{ij}) \right) }{\text {Var}(Y)}, \qquad 1 \le i<j \le n \end{aligned}$$
(6.9)

are the second-order indices that contemplate the effect of the interaction between the variables \(X_i\) and \(X_j\). Keep following we can construct the Sobol’ indices of all orders until the mth order index, \(S_{1,\ldots ,m}\), which represents the contribution of the interaction between all the variables in \(\mathbf{X} \) [16].

To measure the full contribution of the ith random variable \(X_i\) for the total variance either by its single effect or by its interaction with others, we use the total Sobol’ indices, which are defined by

$$\begin{aligned} S_i^T = \sum _{\begin{array}{c} \mathbf{u} \subset \{1, \ldots ,n\} \\ i \in \mathbf{u} \end{array}} S_\mathbf{u } \qquad i=1, \ldots , n. \end{aligned}$$
(6.10)

4 Surrogate Models

To compute Sobol’ indices, defined Sect. 6.3.1, it is necessary to calculate the underlying variances. Despite this task can be done with Monte Carlo (MC)  simulation, the associated computational cost can be high (infeasible for high-dimensional systems) and subjected to numerical instabilities like cancelation errors. A very appealing alternative, which allows circumventing these two problems, is the use of surrogate models based on polynomial chaos expansions [15].

4.1 Polynomial Chaos Expansion

The Polynomial Chaos Expansion (PCE) of the computational model response is a sum of orthogonal polynomials weighted by coefficients to be determined [22, 23, 25, 26], which reads as

$$\begin{aligned} Y = \mathcal {M}(\mathbf{X} ) = \sum _{\alpha =0}^{\infty } y_{\alpha } \Psi _{\alpha }(\mathbf{X} ) \,, \end{aligned}$$
(6.11)

where \(\Psi _{\alpha }(\mathbf{X} )\) are multivariate orthonormal polynomials, associated with the density \(f_\mathbf{X} \), and \(y_{\alpha }\) are the deterministic coefficients to be determined in order to construct the expansion [22]. For computational implementation purposes, a truncated PCE is considered

$$\begin{aligned} Y \approx \mathcal {M}^{PC}(\mathbf{X} ) = \sum _{\alpha =0}^{P} y_{\varvec{\alpha }} \Psi _{\alpha }(\mathbf{X} ) \,, \end{aligned}$$
(6.12)

where P is the number of terms in the expansion, which depends on the number of input random variables m and the maximum degree allowed for the polynomial expansion p, according to the formula \(P + 1 = (m+p)!/(m!p!)\). Note the quality of your PCE is directly dependent on the number of terms you have in the expansion [27].

The family of orthonormal polynomials to be used is chosen according to the model input distribution, in a sense that seeks to minimize the number of terms needed in the expansion to build a good computational representation of the model. Table 6.1 summarizes a list of the most classical polynomial families and underlying distributions. For further details about the construction of the polynomial basis see [22, 23, 27].

Table 6.1 Correspondence between the random variable distribution and the optimal family of orthonormal polynomials [22]

4.2 Calculation of the Coefficients

Several strategies can be adopted to calculate the PCE coefficients. In this section, we describe the basics of the calculation procedure based on a regression, employing the Ordinary Least-Squares (OLS) method because of its simplicity and generality [24]. The reader is encouraged to see further details about other methods for PCE coefficients calculation in [22, 27].

From the moment that we use the truncated PCE from Eq. (6.11) for computer simulations, there is an error that distances it from the “complete” PCE given by Eq. (6.12), which can be rewritten as follows:

$$\begin{aligned} Y = \mathcal {M}(\mathbf{X} ) = \sum _{\alpha =0}^{P} y_{\alpha } \Psi _{\alpha }(\mathbf{X} ) + \varepsilon _P = \mathbf{y} ^T \varvec{\Psi }(\mathbf{X} ) + \varepsilon _P \,, \end{aligned}$$
(6.13)

where \(\varepsilon _P\) is the truncation error, \(\mathbf{y} = \left( y_0, \ldots , y_{P} \right) ^T\) is the vector of coefficients and \(\varvec{\Psi }(\mathbf{x} ) = \{\Psi _0(\mathbf{x} ), \ldots , \Psi _{P}(\mathbf{x} )\}\) is the matrix that assembles the values of all the orthonormal polynomials in \(\mathbf{X} \) [27]. Therefore, coefficients of the truncated PCE are calculated in a way to reduce the truncation error [23]. For that we obtain via MC method a “small” set of \(N_s\) samples of each random variable \(X_i\), called the experimental design

$$\begin{aligned} \varvec{\chi }= \left\{ \mathbf{x} ^{(1)}, \mathbf{x} ^{(2)}, \ldots , \mathbf{x} ^{(N_s)} \right\} \end{aligned}$$
(6.14)

and compute the model response for that samples

$$\begin{aligned} y^{\left( 1\right) }&= \mathcal {M}(\mathbf{x} ^{\left( 1\right) }) \,, \end{aligned}$$
(6.15)
$$\begin{aligned} y^{\left( 2\right) }&= \mathcal {M}(\mathbf{x} ^{\left( 1\right) }) \,, \end{aligned}$$
(6.16)
$$\begin{aligned}&\vdots \end{aligned}$$
(6.17)
$$\begin{aligned} y^{\left( N_s\right) }&= \mathcal {M}(\mathbf{x} ^{\left( N_s\right) }) \,. \end{aligned}$$
(6.18)

The key problem is calculate the PCE coefficients that force the PCE to better fit the responses obtained from the computational model. This is the classic least-squares regression problem

$$\begin{aligned} \mathbf{y} ^T\Psi (\mathbf{x} ) \,\, \approx \,\, \mathcal {M}(\mathbf{x} ), \end{aligned}$$
(6.19)

for which the general solution may be expressed as

$$\begin{aligned} \mathbf{y} ^{*} = \underset{\mathbf{y }}{\text {argmin}} \, \mathbb {E} \left[ \left( \mathbf{y} ^T\Psi (\mathbf{x} ) - \mathcal {M}(\mathbf{x} ) \right) ^2 \right] \,, \end{aligned}$$
(6.20)

where \(\mathbf{x} \) values comes form the experimental design. Note that MC sampling can be costly, so the idea is to choose \({N_s}\) small enough to ensure accuracy without increasing computational cost.

4.3 Surrogate Error Estimation

The construction of a good surrogate requires a rigorous process of validation of the response surface obtained. Use a good error metric is essential to characterize a good approximation [15].

Considering the previous section, where PCE coefficients are computed via OLS method, it is natural to evaluate the approximation error by the coefficient of determination [10]. In this case, we can calculate this quantity from the experimental design used in the regression as follows

$$\begin{aligned} R^2 = 1 - \displaystyle \frac{\frac{1}{Ns} \sum _{i=1}^{Ns}\left( \mathcal {M}(\mathbf{x} ^{\left( i\right) }) - \mathcal {M}^{PC}(\mathbf{x} ^{\left( i\right) }) \right) ^2}{\hat{V}(Y)}, \end{aligned}$$
(6.21)

where \(\hat{V}(Y)\) is the empirical variance of the model evaluations, given by

$$\begin{aligned} \hat{V}(Y) = \frac{1}{Ns-1}\sum _{i=1}^{Ns} \left( \mathcal {M}(\mathbf{x} ^{\left( i\right) }) - \bar{y} \right) ^2 \,, \end{aligned}$$
(6.22)

with

$$\begin{aligned} \bar{y} = \frac{1}{Ns}\sum _{i=1}^{Ns} \mathcal {M}(\mathbf{x} ^{\left( i\right) }) \,, \end{aligned}$$
(6.23)

and \(\mathbf{x} ^{\left( i\right) }\) is the ith evaluation of the random input X. Another option to measure the error is to apply the normalized empirical error [26]. Nevertheless, this measure can be problematic in cases of over-fitting. Thus, if you are not sure about the size of your experimental design, it is recommended to work with the Leave-One-Out (LOO) cross-validation error [27, 28], calculated by

$$\begin{aligned} \epsilon _{\textit{LOO}} = \frac{\displaystyle \sum _{i=1}^{Ns}\left( \mathcal {M}(\mathbf{x} ^{\left( i\right) }) - \mathcal {M}^{PC\backslash i}(\mathbf{x} ^{\left( i\right) }) \right) ^2}{\displaystyle \sum _{i=1}^{Ns}\left( \mathcal {M}(\mathbf{x} ^{\left( i\right) }) - \bar{y} \right) ^2} \,, \end{aligned}$$
(6.24)

where \(\mathcal {M}^{PC\backslash i}\) notation indicates the ith metamodel built using the reduced experimental design \(\varvec{\chi }\backslash \mathbf{x} ^{\left( i\right) } = \{ \mathbf{x} ^{\left( j\right) }, j=1, \ldots , {N_s}, j \ne i \}\).

4.4 PCE-Based Sobol’ Indices

Note that due the orthonormality of the surrogate PCE metamodel the model variances (partials and total)  can be calculated only using the expansion coefficients. The sum of the squared of all the PCE coefficients provides the variance of the model’s response and subtracting those coefficients associated with certain indices, the conditioned variances can be obtained [21, 29]. Therefore, an estimator for the Sobol’ index associated with the subset u is given by

$$\begin{aligned} S_{\mathbf {u}} = \frac{ \displaystyle \sum _{\alpha \in \mathbf {u}} y_{\alpha }^2 }{ \displaystyle \sum _{\alpha =1}^{P} y_{\alpha }^2 } \,, \end{aligned}$$
(6.25)

where \(\mathbf {u}\) is a suitable subset of indices.

We observe that once the surrogate is already calculated, these indices can be obtained with negligible computational cost, since in general, a single evaluation of the model is much more expensive than the sums involved in Eq. (6.25) fraction. In addition to the computational cost issue, the use of a surrogate PCE to calculate Sobol’ indices also eliminates the possibility of numerical cancelation errors in the variance calculation, since only sums of positive quantities are involved in the above algorithm [15].

Fig. 6.2
figure 2

Schematic representation of the process of obtaining the Sobol’ indices by using a surrogate model

5 A Practical Tutorial

5.1 Tutorial Description

Finished the presentation of the theoretical background, we now move to the practice, after all, this is a tutorial goal. The computational methodology to compute Sobol’ indices is composed of three steps: (i) Characterization of the random input; (ii) Construction of the PCE; and (iii) Calculation of the Sobol’ indices. The framework of this process is illustrated in Fig. 6.2.

Here UQLab [30] library is explored to facilitate the implementation of a code to compute the Sobol’ indices. We encourage the reader to look for more details in the UQLab manual [30]. To describe the random input, it is necessary to define the distribution for each random parameter as well as the respective hyperparameters and support. In the second step, you have to define a computational model to simulate your computational experiment, the maximum degree of your polynomials expansion, and the number of samples that will be used to generate the design set for the regression. The PCE coefficients are calculated for each time of observation defined, and you have everything you need to obtain the Sobol’ indices upto the desired order. Note that you cannot select the maximum order for the Sobol’ indices higher than the dimension of the random input vector. Further details about this construction can be seen in UQLab manual for Global Sensitivity Analysis [29].

5.2 SoBioS: Sobol’ Indices for Biological Systems

To facilitate the manipulation of the UQLab packages, the authors developed a computational library called SoBioS—Sobol’ indices for Biological Systems—focused on the simulation of the Sobol’ indices for biological systems with the UQLab tool. This computational library is available in the following repository:

https://americocunhajr.github.io/SoBioS

The auxiliary routines used to create the following example results can be also found there, but, in resume, you need two basic routines: The main file to define the input details, to call the computational model, to perform the sensitivity analysis and plot results, and a QoI file, to define your quantity of interest.

5.3 Example 1: Predator–Prey Dynamics

The first example is the classical Lotka–Volterra model. Better known as predator–prey model, this is a dynamical system used to reproduce a simple predator–prey relationship [18]. It is assumed that preys are capable of reproducing spontaneously and the predators are only able to feed from the previous prey. The standard model considers preys and predators’ reproduction, the natural death of the predator (independent of the prey), and death of preys by hunting. So, the relation between the two species is beneficial for the predator and harmful for the prey.

By assuming a simple representation of these mechanisms, we can construct the following pair of equations:

$$\begin{aligned} \frac{\mathrm{d}V}{\mathrm{d}t}&= a V - b VP \,, \end{aligned}$$
(6.26)
$$\begin{aligned} \frac{\mathrm{d}P}{\mathrm{d}t}&= d VP - c P \,, \end{aligned}$$
(6.27)

where V and P represent the populations of prey and predator, respectively, and the time is measured in years. In the first equation, we have the subtraction between the birth term and the hunt term. Before the hunt is assumed that the predator increases in a proportion of the energetic efficiency d times the number of preys and decreases with a constant ratio. The model parameters as well as the other simulation parameters are described in Table 6.2. The model parameters’ intervals and initial conditions were extracted from the UQLab Bayesian Calibration manual [31] and the other simulation values were chosen by the authors. Numerical integration was performed using the Dormand–Prince adaptive Runge–Kutta method, implemented in ode45, and the response will be restricted to three times instants to simplify the visualization of the results. Our interest can be for the prey or predator population depending on the situation.

Table 6.2 Parameters description and values used to simulate the Sobol’ indices for the predator–prey model

In this first example, we compare the results of sensitivity analysis by MC and Surrogate approaches. Before that, it is necessary to analyze how much samples to use for each one of those methods. Two different strategies are considered in order to evaluate the reliability of the computed Sobol’ indices. For the MC method, the idea is to execute hierarchical simulations increasing the number of samples, using only one value of each QoI, and calculating 95% Confidence Intervals (CI)  by Bootstrap method [23]. For PCE-Sobol the approach is to plot the validation graphs and compare the surrogate response with the full model one, to estimate also the ideal maximum polynomial degree. Some results for each strategy can be observed in Figs. 6.3, 6.4, 6.5, respectively. We can see the convergence as we increase the number of samples in MC because the CI amplitude decreases. On the validation plots of Figs. 6.4 and 6.5 we can observe the good match between the surrogate results and the response for the original computational model in addition to the low order for the \(\epsilon _{LOO}\) cross-validation error. This comparison is done using 10000 samples. By these results, we define a size of 75000 samples for MC simulations and 1000 for PCE simulations. Additionally, the ideal maximum degree for the surrogate approximation was adopted as 6.

Fig. 6.3
figure 3

Comparison between MC-Sobol’ Total Order results using 25000 (left), 50000 (middle) and 75000 (right) samples with 95% confidential intervals (red) calculated with Bootstrap method considering both predators (yellow) and prey (blue) populations in the 10th year

Fig. 6.4
figure 4

Comparison PCE surrogate response and computational model response (top) for predators in years 1 (left), 6 (middle), and 10 (right), associated mismatch error (bottom), and leave-one-out error (top legend) using 1000 samples as experimental design and 10000 as validation set

Fig. 6.5
figure 5

Comparison PCE surrogate response and computational model response (top) for preys in years 1 (left), 6 (middle), and 10 (right), associated mismatch error (bottom), and leave-one-out error (top legend) using 1000 samples as experimental design and 10000 as validation set

Finally, Figs. 6.6 and 6.7 present the comparison of the MC and PCE results for sensitivity analysis. In these graphs, we can observe that the results are pretty close. However, the total order results for MC are slightly smaller. This is due to the negative second-order indices obtained since it suffers from a cancelation error, something to which the calculation via PCE is immune once it does not involve subtractions. This fact has been omitted from the associated figures to avoid any confusion about negative Sobol’ indices, which do not make sense.

Fig. 6.6
figure 6

Comparison between MC-Sobol’ results (left) and PCE-Sobol results (right) for the prey population of the Lotka–Volterra model. The simulation was perform with 75000 samples for the MC and 1000 for the PCE

Fig. 6.7
figure 7

Comparison between MC-Sobol’ results (left) and PCE-Sobol results (right) for the prey population of the Lotka–Volterra model. The simulation was perform with 75000 samples for the MC and 1000 for the PCE

Note that for this simple scenario the cancelation error is not that problematic but wondering a situation of high order indices, the error propagation can affect your results’ conclusions. The second advantage is, of course, in the time of simulation. Even for this simple analysis, the difference is notable. Using an Intel Core i5-8250U 1.6Ghz \(\times \) 8 (8GB of RAM), the MC method needed 2 min, while the PCE performed in 45s, even with the second one plotting more results due to the validation graphs.

Besides all that discussion about methods, the Sobol’ results reveal that, for this parametric intervals and initial conditions, the parameter d is more important in the first year, b in the sixth and a in the final one when analyzing the predator population. For prey the results are completely different. The parameter a control the most part of the variance in the first week but the c parameters assume the control after that.

The great difference of scenarios for each population can sound weird as well as the change of the most important parameter during the passage of time. The truth is that is not only possible but common. This detail for systems with time dependency is essential for control measures. It will be necessary some adaptive strategy to affect different parameters of phenomena at each step of implementing the measure. Also, note that in this case the second-order interactions were not so much important for the total order indices in case of prey but increase significantly the total indices for predators.

5.4 Example 2: NF-\(\kappa \)B Signaling Pathway

NF-\(\kappa \)B is a family of Transcription Factors (TF) that takes part in the regulation of several mammalian cellular processes, to mention some: cell division, apoptosis, inflammation, immune response, and cancer disease [32]. The family consists of five subunits (p50, p52, p65, c-Rel, and RelB), which associate to form functional dimers. The NF-\(\kappa \)B complex when bound to the I\(\kappa \)B inhibitor protein is inactive in the cytoplasm, being released after the phosphorylation of the I\(\kappa \)B protein by the I\(\kappa \)B kinase (IKK) complex. This dissociation of the NF-\(\kappa \)B:I\(\kappa \)B complex also promotes the I\(\kappa \)B ubiquitination and than its degradation. At this moment, NF-\(\kappa \)B is free to be translocated into the nucleus, where it activates gene transcription by binding to specific DNA \(\kappa \)B sites. Among the activated ones are the genes for I\(\kappa \)B, that is, I\(\kappa \)B mRNA are transcripted and translated into new I\(\kappa \)B proteins. Part of it binds to the NF-\(\kappa \)B in the cytoplasm, and part enters the nucleus where it binds to nuclear NF-\(\kappa \)B forming a NF-\(\kappa \)B:I\(\kappa \)B that is exported to the cytoplasm. In both cases, the complex is again the target of the I\(\kappa \)B kinase (IKK), characterizing a negative feedback mechanism.

A mathematical model for the NF-\(\kappa \)B pathway was introduced in 2002 by Hoffman et al. [5]. Their model consists of 26 molecular species (variables) and 64 reaction coefficients (parameters). Besides setting up the system of differential equations for such a complex biological system, they studied the influence of the negative feedback on the sustained and on the damped oscillation of the NF-\(\kappa \)B concentration over time. An interesting follow up of their group’s work was presented in [6]. It is clear from the references list how their mathematical model was influential and how the NF-\(\kappa \)B pathway model is a well succeed system biology example of the interplay between mathematical modeling and experimental analyses.

Taking the bi-compartmental model of [5] as departure point, removing species that have no feedback from NF-\(\kappa \)B and removing slow reactions at the expense of faster ones, Krishna et al. [33] were able to extract the core feedback loop of the model, coming up with a reduced model constituted of 7 species (variables) and 12 reaction coefficients (parameters). The variables of the reduced model are: \(N_n\) and N, the free nuclear and free cytoplasmic NF-\(\kappa \)B concentration, respectively; \(I_n\) and I, the free nuclear and free cytoplasmic I\(\kappa \)B concentration, respectively; \(I_m\), the I\(\kappa \)B mRNA concentration; \(\{NI\}_n\) and \(\{NI\}\), the nuclear and cytoplasmic NF-\(\kappa \)B:I\(\kappa \)B complex concentration, respectively; and IKK, the I\(\kappa \)B kinase concentration. The dynamics of the seven species are depicted in Fig. 6.8, resulting in the system of equations

$$\begin{aligned} \begin{aligned} \dfrac{\mathrm{d}N_n}{\mathrm{d}t}&= k_{N_\mathrm{in}}N - k_{f_n} N_n I_n + k_{b_n} \{NI\}_n ~, \\ \dfrac{\mathrm{d}N}{dt}&= -k_{N_\mathrm{in}}N - k_{f} N I + (k_{b}+\alpha ) \{NI\} ~, \\ \dfrac{\mathrm{d}I_n}{\mathrm{d}t}&= k_{I_\mathrm{in}}I - k_{I_\mathrm{out}}I_n - k_{f_n} N_n I_n + k_{b_n} \{NI\}_n ~, \\ \dfrac{\mathrm{d}I}{\mathrm{d}t}&= -k_{I_\mathrm{in}}I + k_{I_\mathrm{out}}I_n - k_{f} N I + k_{b} \{NI\} + k_{tl} I_m ~, \\ \dfrac{\mathrm{d}\{NI\}}{\mathrm{d}t}&= k_{NI_\mathrm{out}} \{NI\}_n + k_{f} N I - (k_{b}+\alpha ) \{NI\} ~, \\ \dfrac{\mathrm{d}\{NI\}_n}{\mathrm{d}t}&= -k_{NI_\mathrm{out}} \{NI\}_n + k_{f_n} N_n I_n - k_{b_n} \{NI\}_n ~, \\ \dfrac{\mathrm{d}I_m}{\mathrm{d}t}&= k_t N_n^2 - k_m I_m ~. \end{aligned} \end{aligned}$$
(6.28)

The description and the nominal values of the parameters are shown in Table 6.3, according to [33]. In the model, the NF-\(\kappa \)B pathway is activated by an extracellular stimuli, considered by a I\(\kappa \)B kinase (IKK) input, incorporated in the parameter \(\alpha \) (top of Fig. 6.8).

Table 6.3 Parameters descriptions and nominal values of the seven species NF-\(\kappa \)B model [33]
Fig. 6.8
figure 8

Diagram representing the dynamics of the seven species NF-\(\kappa \)B model [33]. The parameters are described in Table 6.3. The bold boxes are the species present on the reduced three species model

Before proceeding we actually note that for I\(\kappa \)B concentration is meant I\(\kappa \)B\(\alpha \) concentration, since as elucidated in [5] is the I\(\kappa \)B\(\alpha \) isoform of the I\(\kappa \)B protein that enters in the negative feedback loop of the NF-\(\kappa \)B pathway and generates a sustained oscillation on the NF-\(\kappa \)B concentration. The solution of the system for the nominal values in Table 6.3 is shown in Fig. 6.9, with initial conditions \((N_{n}, N, I_{n}, I, \{NI\}, \{NI\}_{n}, I_{m})_0 = (0,1,0,0,0,0,0)\) and \(IKK = 0.7\), and it was obtained by numerically integrating the system (6.28) with the ODE solver, ode15s, for stiff equations, where it is observed the spiky oscillation of the NF-\(\kappa \)B concentration in time. With further considerations on the NF-\(\kappa \)B:I\(\kappa \)B complex association and dissociation reactions, they reduced the model even more, coming up with an over reduced model constituted of 3 species: \(N_n, I, I_m\) (the bold boxes in Fig. 6.8), and 5 parameters (here they are not straight reaction coefficients as before, but combinations of those).

Fig. 6.9
figure 9

Numerically integrated solution for the system (6.28), where it can be observed the spiky oscillation of the NF-\(\kappa \)B concentration in time. The constant dashed (red) line is \(N_n = \text {const.} = mean(N_n)\)

As pointed out in [33], the spiky type oscillations are robust with respect to the IKK input variation and that the down-regulated genes by NF-\(\kappa \)B have different times of response, related to this oscillation, that is why the QoI for our sensitivity analysis was the peak duration, that was defined in [33] as the time the NF-\(\kappa \)B concentration spends above its mean. For the solution in Fig. 6.9, the mean is represented by a constant dashed (red) line in the last (bottom) subplot. To build the PCE surrogate model 400 experimental design samples were taken with a maximum polynomial degree of 12. The probabilistic input model consisted of independent uniform random variables for each of the parameters, with lower and upper bounds given by \(1.5\%\) of dispersion around the mean, taken as the nominal values considered at Table 6.3. The quality of the surrogate approximation is appreciated in Fig. 6.10, showing to be reasonable to the purpose. According to the total indices, displayed in Fig. 6.11, the most five influential parameters, in decreasing order, are \(k_t, k_{tl}, k_{I_\mathrm{in}}, k_m, k_{N_\mathrm{in}}\). It is interesting to note that these parameters are the ones that enter the over reduced model composed of three species; this is because these parameters appear in the equations of (6.28) related to the species \(N_n, I, I_m\). The transcription and the translation parameters of the I\(\kappa \)B mRNA are very influential as can be noted from the first-order indices, Fig. 6.12 and its degradation rate \(k_m\) does not have such a first order influence, but \(k_m\) appears with significant coupling with five other species in the second-order level, Fig. 6.13. Actually, the three highest second-order indices, namely, \(k_{N_\mathrm{in}}-k_m\), \(k_t-k_m\) and \(k_{I_\mathrm{in}}-k_m\) are coupled with it. Another parameter that has no first-order influence, but appears coupled with five other species in the second-order level, is the parameter \(k_{N_\mathrm{in}}\). That is why both \(k_m\) and \(k_{N_\mathrm{in}}\) express themselves in the total order indices but not in the first-order ones. Finally, we need to mention that in Fig. 6.13, we are showing only the second-order indices above the threshold value \(10^{-4}\).

Fig. 6.10
figure 10

Comparison PCE surrogate response and original computational model response for the peak duration QoI

Fig. 6.11
figure 11

Total order Sobol’ indices for the peak duration QoI

Fig. 6.12
figure 12

First-order Sobol’ indices for the peak duration QoI

Fig. 6.13
figure 13

Second-order Sobol’ indices for the peak duration QoI

5.5 Example 3: The SIR Model

The famous SIR model is a classical tool to study epidemics [1]. Recently several research groups are using variations of the SIR model in the new coronavirus pandemic research [9]. Of course this model, as will show below, is not ideal to reproduces the Corona phenomena correctly, however, can be useful to test hypotheses and analyze scenarios. The model is based on separating the host population from a disease in three compartments: The Susceptible, S, is the population of individuals able to become infected by the disease upon contact with the pathogen; Infected, I, indicates those who carry the pathogen and has the ability to infect the susceptible; The Removed, R, reunites the individuals who recovered from the disease (gaining immunity) or died from it [1]. The evolution between the compartments is represented in Fig. 6.14 and mathematically formulated by the following set of equations:

$$\begin{aligned} \frac{\mathrm{d}S}{\mathrm{d}t}&= -\frac{bSI}{N} \,, \end{aligned}$$
(6.29)
$$\begin{aligned} \frac{\mathrm{d}I}{\mathrm{d}t}&= \frac{bSI}{N} - aI \,, \end{aligned}$$
(6.30)
$$\begin{aligned} \frac{\mathrm{d}R}{\mathrm{d}t}&= aI \,, \end{aligned}$$
(6.31)

where bSI/N is the infections rate term and aI indicates the removal rate. The QoI in this model is the infected population I(t). The description and the values of the simulation parameters are gathered in Table 6.4. As in the previous example, we have reference for the mean values of the random variables, and the dispersion factor around the nominal values will be of 0.4 for each one. The parameter a can be interpreted as the inverse of the infectious period. Thereby, we took the mean value of 6.5 days estimated by the researchers of the Imperial College group (in their 9th Report) [9]. Besides that, in the same report it is estimated the basic number of reproduction \(\mathcal {R}_0 = b/a\) for several countries. Using this equation of \(\mathcal {R}_{0}\) for the SIR model [34] and \(a = 1\), we estimate a mean for the transmission rate.

Fig. 6.14
figure 14

Schematic representation of the SIR compartmental model’s mechanisms

Table 6.4 Parameters description and values used to simulate the Sobol’ indices for the SIR model

Different from the previous examples, now we will consider also an initial condition as a random variable, to account for the uncertainty in the initial number of infected individuals, that is not well known. If we assume that the host population is constant, the initial number of susceptible is also a random variable, depending on the initial number of infected. Finally, although our system is measured in days, we will display results in weeks to facilitate visualization.

Fig. 6.15
figure 15

Sobol’ indices for the infected population of the SIR model in the weeks of interest simulated using 1000 samples

In Figs. 6.15 and 6.16, the reader can find the Sobol’ indices results and validation plots, respectively. The maximum degree for the PCE was 10 and the experimental design was composed of 1000 samples. It can be noted a very good match between the computational response and the surrogate. To differentiate the previous two examples, here we calculate the third-order Sobol’ index. In this scenario, this index does not show so much information but reveals that the influence of this third-order term maybe increases over time. The same seems to occur with the b importance while the opposite can be observed for \(I_0\). So, maybe if we analyze a previous time windows, the initial number of infected individuals can be more influenced for the sensitivity of the response. Similarly, for a posterior time window we could have a \(I_0\) not important. Of course, new studies are needed to support these hypotheses but are quite interesting to imagine these possibilities. Curiously, the removal rate had practically zero importance. This is not intuitive because we expected some kind of influence over time. But, it is important to be clear that this not mean that the a parameter not change that response if it changes. This result showed that if the two parameters and the initial number of infected individuals changes, b and \(I_0\) will domain how the response will change.

Fig. 6.16
figure 16

Comparison PCE surrogate response and original computational model response for infected population in weeks of interest (red curves), and associated leave-one-out error (legends) using 1000 samples as experimental design and 10000 as validation set

6 Final Remarks

This chapter presented the use of Sobol’ indices for global sensitivity analysis in biological systems. This type of formalism allows us to statistically measure how each of the system’s parameters contributes to its response. This tool can be very useful to identify which factors have a greater contribution in a certain quantity of interest associated with the biological system, which can be interesting, for instance, in the calibration process of a computational model concerning a certain set of observations (data). This analysis tool is presented in the form of a tutorial, with a simplistic summary of the theory, without much mathematical formalism in favor of a better understanding of the fundamental ideas, and application in three biological systems with different levels of complexity. A code in MATLAB was developed and is available in a public repository to facilitate the use of this tool in the most diverse types of biological systems.