Keywords

1 Introduction

The pharmaceutical industry is changing rapidly nowadays. One important change, compared with the situation 10 or 20 years ago, is undoubtedly the increased focus on development of more efficient production processes. The introduction of process analytical technology (PAT) by the Food and Drug Administration [1] forms an important milestone here, since its publication ended a long period of regulatory uncertainty. The PAT guidance indeed makes it clear that regulatory bodies are in favor of more efficient production methods, as long as a safe product can be guaranteed. This opens up new and exciting possibilities for innovation in pharmaceutical production processes.

One of the central concepts in PAT is the design space, which is defined as “the multi-dimensional combination of critical input variables and critical process parameters that lead to the right critical quality attributes” [1]. The term “critical” should be interpreted as “having a significant influence on final product quality.” Changing the process within the design space is therefore not considered as a change. As a consequence, no regulatory postapproval of the process is required for a change within the design space. Almost naturally, this opens up the possibility of increased use of optimization methods for pharmaceutical processes in the future, methods that have been used for a long time in, for example, the chemical industry [2].

Small-molecule (MW < 1,000) drug substances (APIs, NCEs) are typically produced via organic synthesis. In such a production system, the available process knowledge is often relatively large. Process systems engineering (PSE) methods and tools—especially those relying on mechanistic models to represent available process knowledge—are therefore increasingly applied in the frame of pharmaceutical process development and innovation of small-molecule drugs [3], with the aim of shortening time to market while yielding an efficient production process. In essence, mechanistic models rely on deterministic principles to represent available process knowledge on the basis of mass, energy, and momentum balances; given initial conditions, future system behavior can be predicted.

It is, however, not the intention here to provide a detailed review on mechanistic models for biobased production processes of pharmaceuticals. There are excellent textbooks and review articles on the general principles of mechanistic modeling of fermentation processes [47], biocatalysis [8, 9], and mammalian cell culture [10].

Biotechnology research has resulted in a new class of biomolecular drugs—typically larger molecules, also called biologics or NBEs—which includes monoclonal antibodies, cytokines, tissue growth factors, and therapeutic proteins. The production of biomolecular drugs is usually complicated and extremely expensive. The level of process understanding is therefore in many cases lower, compared with small-molecule drug substances, and as a consequence, PSE methods and tools relying on mechanistic models are usually not applied to the same extent in production of biomolecular drugs, despite the fact that quite a number of articles have been published throughout the years on the development of mechanistic models for such processes.

This chapter focuses on the potential use of mechanistic models within biobased production of drug products, as well as the use of good modeling practice (GMoP) when using such mechanistic models [11]. A case study with the yeast model by Sonnleitner and Käppeli [12] is used to illustrate how a mechanistic model can be formulated in a well-organized and easy-to-interpret matrix notation. This model is then analyzed using uncertainty and sensitivity analysis, an analysis that serves as a starting point for a discussion on the potential application of such methods. Strategies for mechanistic model-building are highlighted in the final discussion.

2 Case Study: Aerobic Cultivation of Budding Yeast

Saccharomyces cerevisiae is one of the most relevant and intensively studied microorganisms in biotechnology and bioprocess engineering; For example, out of 151 recombinant biopharmaceuticals that had been approved by the FDA and EMEA in January 2009, 28 (or 18.5 %) were produced in S. cerevisiae [13]. Sonnleitner and Käppeli [12] proposed a widely accepted mechanistic model describing the aerobic growth of budding yeast, and this model is used here to exemplify how a mechanistic model of a bioprocess can be applied to create more in-depth process knowledge. Optimally, the process knowledge should be translated into a mechanistic model, and the model should be updated whenever additional details of the process are unraveled. This model should capture the key phenomena taking place in the process, and be further employed in the development of process control strategies.

However, when developing and using mechanistic models, reliability of the model (hence the credibility of model-based applications) is an important issue, which needs to be assessed using appropriate methods and tools including identifiability, sensitivity, and uncertainty analysis techniques. Unfortunately, literature reporting on mechanistic model developments often lacks the results of such analysis—confidence intervals on estimated parameters, for example, are only sporadically reported—and as a consequence it is not possible to conclude about the quality of the model and its predictions. Seen from a PAT perspective, it is of utmost importance to document that one has constructed a reliable mechanistic model; For example, in case this model would be used later for simulations to help in determining where to put the borders of the design space, it would be difficult to defend the resulting design space—for example, towards the FDA—in case the reliability of the model cannot be documented sufficiently.

One of the challenges in modeling is the identifiability problem, defined as “given a set of data, how well can the unknown model parameters be estimated, hence identified.” Typically, the number of parameters in a mechanistic model is relatively high, and therefore it is often not possible to uniquely estimate all the parameters by fitting the model predictions to experimental measurements. An indication of the parameters that can be estimated based on available data can be obtained by performing an identifiability analysis prior to the parameter estimation.

Furthermore, the model predictions will depend on the values of all parameters. Some of the parameters will, however, have a stronger influence than others. An uncertainty and sensitivity analysis can be performed to determine which are the parameters whose variability contributes most to the variance of the different model outputs.

In this case study, a systematic model analysis is performed following the workflow presented in Fig. 1. This workflow is rather generic, and could easily be transferred to another case study with a similar model.

Fig. 1
figure 1

Schematic workflow for the model analysis

2.1 Model Formulation

Under aerobic conditions, budding yeast may exclusively oxidize glucose (respiratory metabolism), or simultaneously oxidize and reduce glucose (fermentative metabolism) if the respiratory capacity of the cells is exceeded. The described overflow metabolism is commonly referred to as the Crabtree effect. Cells preferably oxidize glucose, as the energetic yield is more favorable for respiration than fermentation. In case the respiratory capacity is reached, the excess of glucose (i.e., overflow of glucose) is reduced using fermentative pathways that result in the production of ethanol. Moreover, in a second growth phase, yeast will then consume the produced ethanol, but only after depletion of glucose, as the latter inhibits the consumption of any other carbon source. Also acetate and glycerol are formed and consumed, although the corresponding concentrations are typically much lower than for ethanol.

The Sonnleitner and Käppeli [12] model describes the glucose-limited growth of Saccharomyces cerevisiae. This model is able to account for the overflow metabolism, and to predict the concentrations of biomass, glucose, ethanol, and oxygen throughout an aerobic cultivation in a stirred tank reactor. Acetate and glycerol are not included for simplification purposes. The model relies on three stoichiometric reactions describing the growth of biomass on glucose by respiration (Eq. 1) and by fermentation (Eq. 2), as well as the growth of biomass on ethanol by respiration (Eq. 3). The stoichiometry of the three different pathways can be summarized in a matrix form (Table 1) describing how the consumption of glucose, ethanol, and oxygen are correlated with the production of biomass and ethanol, i.e., the yields of the reactions. The mol-based stoichiometric coefficients can be converted into the corresponding mass-based yields, e.g., Y OxidXG  = b × MW(biomass)/MW(glucose).

Table 1 Stoichiometric matrix describing aerobic growth of budding yeast
$$ {\text{C}}_{ 6} {\text{H}}_{ 1 2} {\text{O}}_{6} \, + \,a{\text{O}}_{2}\, + \,b\;0.15\;\left[ {{\text{NH}}_{3} } \right] \to b{\text{C}}_{1} {\text{H}}_{ 1. 7 9} {\text{O}}_{ 0. 5 7} {\text{N}}_{ 0. 15} \,+\, c{\text{CO}}_{ 2}\,+\, d{\text{H}}_{ 2} {\text{O}} $$
(1)
$$ {\text{C}}_{ 6} {\text{H}}_{ 1 2} {\text{O}}_{ 6} \, + \,g\;0.15\;\left[ {{\text{NH}}_{ 3} } \right] \to g{\text{C}}_{ 1} {\text{H}}_{ 1. 7 9} {\text{O}}_{ 0. 5 7} {\text{N}}_{ 0. 1 5} \,+\, h{\text{CO}}_{2} \,+ \,i{\text{H}}_{ 2} {\text{O}} \,+ \,j{\text{C}}_{ 2} {\text{H}}_{ 6} {\text{O}} $$
(2)
$$ {\text{C}}_{ 6} {\text{H}}_{ 6} {\text{O}} + k{\text{O}}_{2} + l\;0.15\left[ {{\text{NH}}_{ 3} } \right] \to l{\text{C}}_{ 1} {\text{H}}_{ 1. 7 9} {\text{O}}_{ 0. 5 7} {\text{N}}_{ 0. 1 5} + mCO_{2} + n{\text{H}}_{ 2} {\text{O}} $$
(3)

For each pathway, a mass balance can be established for each atomic element (e.g. C or N). To solve such elemental balances for carbon, hydrogen, and oxygen, one stoichiometric coefficient for each pathway has to be assumed. Since the biomass yield coefficients are often easily estimated from experimental data, they are typically the ones that are assumed. Therefore, only the coefficients b, g and l, or the corresponding mass yields Y OxidXG , Y RedXG , and Y XE will be considered as model parameters; i.e., the other stoichiometric coefficients are fixed based on Eqs. 13.

Furthermore, a process matrix can be used to describe the rates of consumption and production of each of the model variables (glucose, ethanol, oxygen, and biomass), as well as the fluxes in each pathway. Details on the use of this matrix notation are provided by Sin and colleagues [14]. The interested reader can find additional details on elemental mass and energy balances applied to fermentation processes elsewhere [15, 16].

In the case of the model used as an example here, the total glucose consumption and ethanol consumption rates (when considered individually) are mathematically described using Monod-type kinetics (Eqs. 46). The maximum uptake rates for glucose, ethanol, and oxygen (r i,max) are model parameters, and they are characteristic of the S. cerevisiae strain being used. The same goes for the substrate saturation constants: K G, K E, and K O. The maximum oxygen uptake rate (r O,max) corresponds to the respiratory capacity, as it reflects the maximum rate for oxidation of glucose or ethanol when any of these carbon sources is in excess. The ethanol uptake rate includes a term accounting for glucose repression; i.e., ethanol consumption is only observed for low concentrations of glucose. The strength of inhibition (i.e., how low the glucose concentration should be before ethanol consumption is allowed) is defined by the inhibition constant K i. The specific growth rate of biomass is defined as the sum of the growth resulting from each pathway, and is estimated based on the yield of biomass on the substrate and the corresponding uptake rate (Eq. 7).

$$ {\text{r}}_{\text{G}}^{\text{Total}} = {\text{r}}_{\text{G,max}} \frac{\text{G}}{{{\text{G + }}K_{\text{G}} }} = {\text{r}}_{\text{G}}^{\text{Oxid}} + {\text{r}}_{\text{G}}^{\text{Red}} $$
(4)
$$ {\text{r}}_{\text{E}} = {\text{r}}_{{{\text{E}},\hbox{max} }} \frac{\text{E}}{{{\text{E}} + K_{\text{E}} }}\;\frac{{K_{i} }}{{{\text{G}} + K_{i} }} $$
(5)
$$ {\text{r}}_{\text{O}} = {\text{r}}_{{{\text{O}},\hbox{max} }} \frac{\text{O}}{{{\text{O}} + K_{\text{O}} }} $$
(6)
$$ \mu_{\text{Total}} = + Y_{\text{XG}}^{\text{Oxid}} \times {\text{r}}_{\text{G}}^{\text{Oxid}} + Y_{\text{XG}}^{\text{Red}} \times {\text{r}}_{\text{G}}^{\text{Red}} + Y_{\text{XE}} \times {\text{r}}_{E}^{Oxid} $$
(7)

The rate of oxidation and the rate of reduction of glucose are defined based on the maximum oxygen uptake rate: if the oxygen demand that is stoichiometrically required for oxidation of the total glucose flux (Y OG × r G Total) exceeds the maximum oxygen uptake rate (r O,max), the difference between the two fluxes corresponds to the overflow reductive flux. With regard to the oxidation of ethanol, the observed rate of ethanol oxidation depends on the ethanol availability (Eq. 5) and it is further limited by the respiratory capacity: not only the maximum capacity of the cell, but also the capacity remaining after considering metabolism of glucose (Table 2).

Table 2 Process matrix describing the conversion rates and stoichiometry for each model variable: glucose, ethanol, oxygen, and biomass

In addition to the reactions taking place in the cells, oxygen is continuously supplied to the bioreactor. This supply is described based on the mass transfer coefficient (k L a) and the difference between the dissolved oxygen concentration (O) and the saturation concentration of oxygen in water (O*) as a driving force. k L a is dependent on the aeration intensity and the mixing conditions in a given fermentor. It is also dependent on the biomass concentration, although this dependence is often disregarded. The rates for each component can be obtained from the process model matrix (Table 2) by multiplying the transpose of the stoichiometric matrix (Z’) by the process rate vector (ρ): \( r_{m,1} = {\mathbf{Z}}\prime_{nxm} \times {\varvec{\rho}}_{nx1} \), where m corresponds to the number of components (or model variables) and n is the number of processes. In Table 3, a nomenclature list of vectors and matrices is presented.

Table 3 Nomenclature list of matrices and vectors used in the model formulation and model analysis

The model matrix in Table 2 provides a compact overview of the model equations. In the example here, it contains information about the biological reactions and the transfer of oxygen from the gas to the liquid phase. Of course, depending on the purpose of the model, the model matrix could be extended with additional equations, for instance, aiming at a more detailed description of the biological reactions, e.g., by including additional state variables, or aiming at the description of the mass transfer of additional components, e.g., CO2 stripping from the fermentation broth. Sin and colleagues [14] provided an example of the extension of the model matrix with chemical processes for the kinetic description of mixed weak acid–base systems. The latter is important in case pH prediction is part of the purpose of the model. In the work of Sin and colleagues [14], the yield coefficients are all part of the stoichiometric matrix. In our case here, an alternative rate vector is presented, where all rates are normalized with regard to glucose.

2.2 Parameter Identifiability Analysis

The model described in the previous sections has four variables—glucose (G), ethanol (E), oxygen (O), and biomass (X)—and 11 parameters. In addition, the oxygen saturation concentration in water (at growth temperature) is necessary for solving the model. A list of the parameters and their descriptions is provided in Table 4.

Table 4 Model parameters, corresponding units, and numerical values [12]

The maximum specific growth rate on ethanol \( (\mu_{E,max}) \) is defined as the product of the yield of biomass on ethanol \( (Y_{XE})\) and the maximum specific ethanol uptake rate \((r_{E,max}^{Oxid})\). For consistency between parameters, the ethanol specific uptake rate is used as a parameter in this example.

The number of parameters is considerably larger than the number of model variables (or outputs), which is typical for this type of model. It is therefore questionable whether all parameters can be estimated based on experimental data, even if the four model variables were to be measured simultaneously. This is the subject of identifiability analysis, which seeks to identify which of the parameters can be estimated with high degree of confidence based on the available experimental measurements.

The main purpose of such an identifiability analysis is in fact to increase the reliability of parameter estimation efforts from a given set of data [17]. One method available to perform such an analysis is the two-step procedure based on sensitivity and collinearity index analysis proposed by Brun and colleagues [18]. Accordingly, the method calculates two identifiability measures: (1) the parameter importance index (δ) that reflects the sensitivity of the model outputs to single parameters, and (2) the collinearity index (γ) which reflects the degree of near-linear dependence of the sensitivity functions of parameter subsets. A parameter subset (a combination of model parameters) is said to be identifiable if (1) the data are sufficiently sensitive to the parameter subset (above a cutoff value), and (2) the collinearity index is sufficiently low (below a cutoff value).

2.2.1 Local Sensitivity Analysis: Parameter Importance Indices δ

The local importance of an individual parameter to a model output for small changes (Δθ) in the parameter values (θ) at a specific location (θ 0) can be measured by the estimation of a dimension-free scaled sensitivity matrix S sc = {s ij }, where the index i refers to a specific model variable (output) and j denotes the model parameter. For further details, the reader is referred to the original paper of Brun and colleagues [18]. The mean squared norm of column s j , denoted by δ j , is a measure of the importance of parameter θ j (see Eqs 810). A large norm indicates that the parameter is identifiable with the available data if all other parameters are fixed. A parameter importance ranking can be obtained by ranking the parameters according to their δ indices. The lower the value of δ, the lower the importance of that parameter.

For this first analysis, the parameter values (Table 4) provided in the original paper [12] are used as nominal values at which sensitivity functions are calculated. The scaled sensitivity matrix S and the resulting rank of δ importance indices were calculated using Eqs. 810, and are graphically compared in Fig. 2. It is noteworthy that the δ indices are very sensitive to: (1) the choice of variation range defined for each parameter (Δθ), (2) scaling factors (sc) used to calculate the sensitivity matrix, and (3) the original set of parameters (θ 0), naturally as this is a local analysis. In this example the sc were defined as the mean of the experimental observations for each variable.

Fig. 2
figure 2

Parameter importance indices (δ) for the four model variables: glucose, ethanol, dissolved oxygen and biomass

$$ v_{ij} = \left. {\frac{{\partial \eta_{i} (\theta_{j} )}}{{\partial \theta_{j} }}} \right|_{{\theta - \theta_{0} }} $$
(8)
$$ S_{ij} = v_{ij} \frac{{\Updelta \theta_{j} }}{{SC_{j} }} $$
(9)
$$ \delta_{j} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {s_{ij}^{2} } } $$
(10)

The results of the parameter significance ranking indicate that the yield coefficient Y OxidXG is the parameter that most affects all four model outputs. Variations in the maximum uptake rates will also have a significant effect on the model outputs. As may be expected, the glucose maximum uptake rate is most significant with regard to the model prediction for glucose, whereas the maximum uptake rate of ethanol is most important for the prediction of ethanol and dissolved oxygen. The prediction of biomass is also greatly affected by the yield of biomass on ethanol, in addition to the yield on glucose (oxidative metabolism). The impact of the saturation constants is rather limited for any of the model variables.

2.2.2 Identifiability of Parameter Subsets: Collinearity Index γK

In addition to understanding the importance of individual parameters to the model output, it is necessary to take the joint influence of all parameters into account as well ([θ 1, , θ j=J]). If columns s j are nearly linearly dependent, the change of a parameter θ j can be compensated by a change in the other parameter values. This means that the parameters [θ 1, , θ J ] are not uniquely identifiable.

The collinearity index γ K assesses the degree of near-linear dependence between a subset of K (2 ≤ K ≤ J) parameters, i.e., columns of the scaled sensitivity matrix.Footnote 1

A high value of a collinearity index indicates that the parameter set is poorly identifiable. In practice, γ K is calculated for all subsets of K parameters out of the 11 parameters and is plotted in Fig. 3. Also the subset size for each case is shown. In this case, a subset was considered identifiable if the corresponding collinearity index was smaller than 5. This threshold has to be defined a priori. Brun and colleagues [18] suggested as a rule of thumb that this threshold should lie in the range 5–20, where the lowest collinearity index corresponds to the strictest criterion. In practice, this decision on the threshold value is dependent on prior experience of the model user, and thus an iterative process.

Fig. 3
figure 3

Collinearity index and size corresponding to parameter subsets of increasing size. The top plot refers to all the parameter subsets evaluated in the analysis, whereas the bottom figure refers exclusively to the subsets that complied with the a priori defined collinearity threshold

All the model variables were considered in this analysis, implying as well that all could be measured experimentally. As illustrated in Fig. 3, a maximum of eight parameters can be identified, and the collinearity index increases with the number of parameters. The maximum collinearity index observed for combinations of eight parameters was 22.34, while the best identifiable sets of eight parameters correspond to a γ K value of approximately 2.65. These parameter subsets are listed in Table 5.

Table 5 Identifiable parameter subsets with maximum number of parameters and corresponding collinearity index

It is indeed known that a change in the maximum uptake rate of glucose can be compensated with a change of biomass yield coefficients. Also, based on the model structure, it is clear that changes in yields for the oxidative and reductive consumptions of glucose can compensate each other. It is therefore not surprising that the parameter subsets that have higher collinearity index include these parameters. When comparing the subset of six parameters with the lowest collinearity index (last row in Table 5) with the “best” subset of eight parameters (shaded row in Table 5), the two parameters that have been removed in the subset of six parameters are the maximum uptake rates of ethanol and oxygen.

The collinearity between the uptake rates and the yield coefficients explains why, even though they are the parameters with greatest importance for the model outputs (Fig. 2), they are not all included in the identifiable parameter subsets.

2.3 Parameter Estimation

Two datasets corresponding to two replicate batch fermentations of S. cerevisiae were available. For further details on the experimental data collection methods the reader is referred to the work of Carlquist et al. [19]. The dynamic profiles of glucose, ethanol, and biomass (as optical density, OD) were available for the two datasets, while oxygen data were only available for one of them. The OD measurements were converted into biomass dry weight (DW) values using a previously determined linear correlation (DW = 0.1815 × OD).

The parameters in the “best” identifiable subset were estimated by minimization of the weighted least-square errors. The weights for each variable i were defined by \( w_{i} = {1 \mathord{\left/ {\vphantom {1 {\left( {sc_{i} } \right)^{2} }}} \right. \kern-0pt} {\left( {sc_{i} } \right)^{2} }} \), and the scaled factors (also used in Eq. 9) were defined as the mean of the experimental observations for each given variable. The estimation was done simultaneously for the two datasets. The new estimates of the identifiable parameters are presented in Table 6. In Fig. 4, the model predictions obtained with the estimated set of parameters are compared with the experimental data.

Table 6 Estimated values for the identifiable subset of parameters
Fig. 4
figure 4

Comparison of model predictions versus experimental data collected for cultivation 1 (black line model prediction, black circles experimental data) and cultivation 2 (blue dashed line model prediction, blue stars, experimental data)

Generally, the model predictions are in good agreement with the experimental data. An overprediction of the biomass concentration and a slight underestimation of the ethanol concentration are however observed. The oxygen profile describes the drop of the dissolved oxygen concentration during the growth, and a steep increase upon the depletion of ethanol and the resulting growth arrest. The dynamics of oxygen described by the model assumes a constant mass transfer coefficient (k L a) and equilibrium between the gas and liquid phases. It is worth mentioning that the formation of other metabolites (i.e., glycerol and acetate) that are not considered in the model may explain the discrepancies to some degree. In fact, the overestimation of biomass which can be observed in Fig. 4 may be caused by the fact that other carbon-containing metabolites have not been taken into account.

When assessing the goodness of fit of the mechanistic model, it is important to consider that the experimental measurements have an associated error as well. Model predictions may not give a “perfect” fit at first sight, but they may well be within the experimental error. While such error might be relatively low for the measurement of glucose and ethanol by high-performance liquid chromatography (HPLC), it is significantly higher for dry weight measurements, which are less reliable, especially for low biomass concentrations (too large sample volumes would be required for increasing accuracy). Additionally, at the end of the fermentation, the biomass dry weight may include a fraction of nonviable and/or dormant cells.

2.3.1 Confidence Intervals for Estimated Parameters

The estimated parameter values as such only have limited value if they are not presented in combination with a measure of the degree of confidence that one can have in them. Therefore, the confidence intervals for each of the parameters are defined based on the covariance matrix and Student t-probability distribution. The covariance matrix is calculated using the residuals between model predictions and the standard deviations of the experimental measurements (further details are provided by Sin et al. [14]). An experimental error of 5 % was assumed for glucose and ethanol measurement by high-performance liquid chromatography (HPLC), as well as for the oxygen measurements using a gas analyzer for determining the composition of the exhaust gas, and a 20 % error for the determination of the cell dry weight. The confidence intervals at (1 − α) confidence level were calculated using Eq. 11, where COV is the covariance matrix of the parameter estimators, t(N − M, α/2) is the t-distribution value corresponding to the α/2 percentile, N is the total number of experimental observations (45 samples for the two cultivations), and M is the total number of parameters. The confidence intervals for the estimated parameters are presented in Table 7.

Table 7 Confidence intervals for the identifiable subset of parameters for 95 % confidence level
$$ \theta_{1 - \alpha } = \theta \pm \sqrt {{\text{diag}}({\text{COV}}(\theta ))} \cdot t\left( {N - M,\frac{\alpha }{2}} \right). $$
(11)

None of the confidence intervals include zero, giving a first indication that all parameters are significant to a certain degree and the model does not seem to be overparameterized. In the case of the inhibition constant K i, the confidence interval is rather large. This is most likely a consequence of the low sensitivity of model outputs to this variable (Fig. 2). Furthermore, the confidence intervals of the Monod half-saturation constants K G and K E are quite large as well, which might be related to the fact that their estimated values are rather low. The latter means that the collected data do not contain that many data points which can be used during the parameter estimation for extracting information on the exact values of K G and K E Indeed, only the data corresponding to relatively low glucose and ethanol concentrations can be used, since the specific rates will be relatively constant and close to maximum for higher substrate concentrations.

It is furthermore also a good idea to analyze the values of the parameter confidence intervals simultaneously with the correlation matrix (Table 8); For example, the correlation matrix shows that r E,max is correlated with K E and that r O,max is correlated with K O. Both correlations are inherent to the model structure; i.e., correlation between the parameters related to the maximum specific growth rate and the substrate affinity constant in Monod-like kinetics expressions are quite common, and point towards a structural identifiability issue.

Table 8 Correlation matrix for all model parameters

Note also that the significant correlations found between some of the model parameters (Table 8) seem to conflict with the results of the collinearity index analysis which was reported earlier (Fig. 3; Table 5). That is one of the reasons also for the identifiability analysis to be an iterative process.

2.4 Uncertainty Analysis

Uncertainty analysis allows for understanding the variance of the model outputs as a consequence of the variability in the input parameters. Such an analysis can be performed using the Monte Carlo procedure, which consists of three steps: (1) definition of the parameter space, (2) generation of samples of the parameter space, i.e., combinations of parameters, and (3) simulation of the model using the set of samples generated in the previous step. In this case study, a sample set of 1,000 combinations of parameter values was generated using the Latin hypercube sampling procedure [20]. This sampling technique can be set up such that it takes the correlations between parameters, i.e., information resulting from the parameter estimation, into account (as explained by Sin et al. [11]). The correlation matrix for all the parameters was estimated and is presented in Table 8. For each parameter, minimum and maximum values have to be defined: for the estimated parameters the limits of the 95 % confidence intervals were used, while a variability of 30 % around the default values was assumed for the remaining parameters.

The correlation between two parameters can take values between −1 and 1. A positive correlation indicates that an increase in the parameter value will result in an increase in the value of the other parameter as well. On the contrary, a negative value indicates an inverse proportionality. In Fig. 5, the sampling space is illustrated by scatter plots of combinations of two parameters. A high correlation (in absolute value) will lead to an elliptical or linear cloud of sampling points, as, for example, for Y OxidXG and Y RedXG [corr(Y OxidXG , Y RedXG ) = −0.98 in Table 8], as well as r E,max and K E, and r O,max and K O.

Fig. 5
figure 5

Latin hypercube sampling for the model parameters, taking into account the correlation between them

The number of samples and the assumed range of variability of each parameter (i.e., the parameter space) is defined by the expert performing the analysis. The higher the number of samples, the more effectively the parameter space will be covered, at the expense of increased computational time. The range of the parameter space should rely on previous knowledge of the process: (1) the initial guess of the parameter numerical values can be obtained from the literature or estimated in a first rough estimation where all parameters are included; (2) the variability (range) for each parameter can be determined by the confidence intervals, in case a parameter estimation has been done, or be defined based on expert knowledge as discussed by Sin et al. [11].

The estimations for the four model variables (outputs) and the corresponding mean and a prediction band defined by 10 and 90 % percentiles are presented in Fig. 6. The narrow prediction bands (including 80 % of the model predictions) for glucose reflect the robustness of the predictions for this model variable, while the wide bands observed, for example, for oxygen show the need for a more accurate estimate of the parameters in order to obtain a good model prediction.

Fig. 6
figure 6

Representation of uncertainty in the model predictions for glucose, ethanol, dissolved oxygen, and biomass: Monte Carlo simulations (blue), mean, and the 10th and 90th percentile of the predictions (black)

2.5 Sensitivity Analysis: Linear Regression of Monte Carlo Simulations

Based on the Monte Carlo simulations, a global sensitivity analysis can be conducted. The aim of the sensitivity analysis is to break down the output uncertainty with respect to input (parameter) uncertainty. The linear regression method is a rather simple yet powerful analysis that assumes a linear relation between the parameter values and the model outputs. The sensitivity of the model outputs to the individual parameters, for a given time point, is summarized by a ranking of parameters according to the absolute value for the standardized regression coefficient (SRC). In a dimensionless form, the linear regression is described by Eq. 12, where sy ik is the scalar value for the kth output, β jk is the SRC of the jth input parameter, θ j , for the kth model output, y k , and its magnitude relates to how strongly the input parameter contributes to the output.

$$ \frac{{sy_{ik} - \mu_{{sy_{k} }} }}{{\sigma_{{sy_{k} }} }} = \sum\limits_{j = 1}^{M} {\beta_{jk} \times \frac{{\theta_{ij} - \mu_{\theta j} }}{{\sigma_{{\theta_{j} }} }}} + \varepsilon_{ik} $$
(12)

In the case of nonlinear dependence of the model variable on a parameter, this method can still be used, although with caution. As a rule of thumb, if the model coefficient of determination (R 2) is lower than 0.7, this analysis is not conclusive. The SRC for each parameter has, by definition, a value between −1 and 1, where a negative sign indicates that the output value will decrease when there is an increase in the value of the parameter. Oppositely, a positive SRC indicates direct proportionality between the parameter value and the model output. Sin et al. [11] describe further details on how to perform the analysis.

In the model example, different growth phases are described, and therefore the importance of the parameters is expected to change with time. Therefore, the analysis was performed for a selection of time points up to 62 h.

The suitability of applying the linear regression method was in this case also assessed for each time point and each output. The R 2 values are presented in Fig. 7 as a function of time.

Fig. 7
figure 7

Regression correlation coefficient (R 2) for each model output, indicating the goodness of the linear regression used for estimating the sensitivity of each model output to various parameters. For R 2 values lower than 0.7, the corresponding standardized regression coefficient (SRC) may yield erroneous information

While the regression method seems to be suitable for all time points in the case of biomass, the same is not observed for glucose, ethanol, and oxygen. With regard to glucose, the model uncertainty is very small (narrow spread of the model predictions plotted in Fig. 6). The depletion of glucose is estimated to occur at time of approximately 22 h for all cases. The sensitivity analysis when the glucose concentrations are virtually zero is not expected to be significant, and it is thus not surprising that the R 2 value decreases abruptly at approximately the same time point that glucose is depleted. Simultaneously, the uncertainty in ethanol concentration predictions increases substantially. This may explain the temporary drop in the R 2 value for ethanol at this time point. A similar drop in R 2 is observed for oxygen around the time that ethanol is depleted, and a sudden rise in the dissolved oxygen concentration is observed. Upon ethanol depletion, the R 2 value for ethanol falls under the threshold, similarly to what was observed for glucose at its depletion.

In Fig. 8, an overview of the SRCs for each parameter and model output is presented. Interpretation of parameter ranking and SRC should be made cautiously. All model outputs seem to be sensitive to the yield coefficient of biomass on oxidized glucose, even during the growth phase on ethanol (after glucose depletion).

Fig. 8
figure 8

Standardized regression coefficients (SRC) for the four model outputs as a function of time. Only the time points for which R 2 > 0.7 was observed are presented. Each color corresponds to a model parameter

The ranking of each parameter according to the SRC for each model output is illustrated in Fig. 9. When analyzing this ranking, it is possible to see the decrease in sensitivity of the glucose prediction towards the maximum glucose uptake rate, as well as the simultaneous increase in sensitivity towards the maximum oxygen uptake rate, during the growth phase on glucose. This is in agreement with the fact that the consumption of glucose is initially only limited by the maximum uptake rate (excess of glucose in the media), and afterwards as the biomass concentration increases and glucose concentration decreases, the observed uptake rate is no longer maximal. Similar figures for the parameter ranking regarding ethanol, oxygen and biomass can be drawn.

Fig. 9
figure 9

Ranking of each model parameter according to the magnitude of the SRC for each model output: a rank of 1 indicates that the model output is most sensitive to that parameter, while a rank of 11 indicates that the parameter contributes the least to the variance of the model output

Fig. 10
figure 10

Model simulation results using Morris sampling of parameter space: model simulations for glucose, ethanol, dissolved oxygen, and biomass showing simulations (blue), mean, and the 10th and 90th percentile of the simulations (black) (not to be confused with uncertainty analysis)

Fig. 11
figure 11

Elementary effects during growth phase on glucose: estimated mean and standard deviation of the distributions of elementary effects of the 11 parameters on the model outputs. The two lines drawn in each subplot correspond to \( {\text{Mean}}_{i} \; \pm 2{\text{sem}}_{i} \) (see text)

Fig. 12
figure 12

Elementary effects during growth phase on ethanol: estimated mean and standard deviation of the distributions of elementary effects of the 11 parameters on the model outputs. The two lines drawn in each subplot correspond to \( {\text{Mean}}_{i} \; \pm 2{\text{sem}}_{i} \) (see text)

With regard to the model predictions for ethanol, this model output is most sensitive to the maximum glucose uptake rate and biomass yield on glucose (reduction pathway) during the first growth phase, and later on the maximum ethanol uptake rate. This is in good agreement with the fact that the production of ethanol is a result of the reduction of glucose, and its consumption only takes place during the second growth phase following the depletion of glucose. A similar pattern was observed with regard to the model predictions for oxygen.

To analyze the sensitivity of the outputs to the parameters in more detail, two time points during the exponential growth phase on glucose (t = 17 h) and on ethanol (t = 27 h) were selected. The SRC and corresponding rank position for these time points are provided in Table 9a and b, respectively. As could be expected, during the growth on glucose, the parameters that most influence the prediction of glucose are the biomass yield parameters (for the two pathways) and the maximum uptake rate. The two yield coefficients have, however, a different effect on the glucose prediction: while an increase in the oxidative yield will lead to a lower predicted concentration, an increase in the reductive yield seems to imply an increase in the predicted concentration. This may reflect the fact that the oxidative pathway is the most effective way of transforming glucose into biomass.

Table 9a Ranking and SRC value of the model parameters for each model output, for a time point during the exponential growth phase on glucose
Table 9b Ranking and SRC value of the model parameters for each model output, for a time point during the exponential growth phase on ethanol

The maximum glucose uptake rate is also the most influential parameter for the prediction of the ethanol concentration (produced by reduction of glucose), during this first growth phase. The glucose saturation rate plays an important role, however not as significant as the maximum uptake rate (r G,max: SRC = 0.74; K G: SRC = −0.48).

Obviously, the results of the global sensitivity analysis (SRC) should be compared with the results of local sensitivity analysis (Fig. 2). It can be seen that both methods rank the biomass yield on glucose (oxidation) as the most influential parameter. For the ranking of the other parameters, there are quite some differences between the results obtained by the two methods.

2.5.1 Morris Screening

As discussed by Sin et al. [11], an alternative to the linear regression method, especially when low R 2 values are observed, is Morris screening. Similarly to the linear regression method, a sampling-based approach is used. The method is based on Morris sampling, which is an efficient sampling strategy for performing randomized calculation of one-factor-at-a-time (OAT) sensitivity analysis. The parameters are assigned uniform distributions with lower and upper bounds defined by the confidence intervals for estimated parameters and by 30 % variability for the remaining ones (as done previously for the Latin hypercube sampling). The number of repetitions (r) was set to 90, corresponding to a sampling matrix with 1,080 [90 × (11 + 1)] different parameter combinations. The model was simulated for all the parameter combinations, and the results are summarized in Fig. 10.

The elementary effects (EE) were estimated as described by Sin et al. [11]. These EEs are described as random observations of a certain distribution function F, and are defined by Eq. 13, where Δ is a predetermined perturbation factor of θ j , sy k (θ 1, θ 2, θ j ,…, θ M ) is the scalar model output evaluated at input parameters (θ 1, θ 2, θ j ,…, θ M ), whereas sy k (θ 1, θ 2, θ j  + Δ,…, θ M ) is the scalar model output corresponding to a Δ change in θ j .

$$ \begin{aligned} EE_{jk} &= \frac{{\partial sy_{k} }}{{\partial \theta_{j} }} \\ &= \, \frac{{sy_{k} (\theta_{1} ,\theta_{2} ,\theta_{j} + \Updelta , \ldots ,\theta_{M} ) - sy_{k} (\theta_{1} ,\theta_{2} ,\theta_{j} , \ldots ,\theta_{M} )}}{\Updelta } \\ \end{aligned} $$
(13)

The results obtained are compared with the mean and the standard deviation of this distribution. Often, the EEs obtained for each parameter are plotted together with two lines defined by Mean i  ± sem i , where Mean i is the mean effect for output i and semi is the standard error of the mean \( ({\text{sem}}_{i} = {\text{std}}\;{\text{deviation}}_{i} /\sqrt r ) \). The EEs are scaled, and thus a comparison across parameters is possible.

Also this analysis has to be performed for a selected time point, or using a time-series average. As the cultivation has distinct phases, several time points were selected. The results for the growth phase on glucose (t = 17.2 h) and the growth phase on ethanol (t = 27.2 h) are presented in Figs. 11 and 12, respectively.

Parameters that lie in the area in between the two curves (inside the wedge) are said to have an insignificant effect on the output, while parameters outside the wedge have a significant effect. Moreover, nonzero standard deviations indicate nonlinear effects, implying that parameters with zero standard deviation and nonzero mean have a linear effect on the outputs.

During growth on glucose (Fig. 11) only a few parameters show a significant effect on the model outputs. While Y OxidXG seems to have a nonlinear effect on the glucose prediction, r G,max has a linear one. The effects of other parameters are mostly nonlinear, as expected given the structure of the model used in the example. The former parameter has also a significant effect on oxygen and biomass, while the latter parameter has a significant effect on ethanol.

With regard to results for a time point during growth on ethanol, it is important to note that Y OxidXG appears to have a significant effect on the ethanol, oxygen, and biomass predictions, although the glucose has been depleted. This may reflect the impact of the biomass concentration (originated during the prior growth on glucose) on the total amount of ethanol produced, as well as its consumption and the consumption of oxygen for the observed time point.

There is good agreement of the results of the Morris analysis with the previously presented SRC ranking obtained for the linear regression method. In Figs. 11 and 12, the parameters most distant from the wedge are the parameters ranked as the most influential on the model outputs (Table 9a, b).

3 Discussion

A mechanistic model of glucose oxidation by Saccharomyces cerevisiae has been taken as an example and has been analyzed rigorously with a number of methods. The chosen case study is purposely kept relatively simple in order to better illuminate how the different methods work and what kind of information is gained in each step. In practice, the presented analysis methods are generic and can be applied to a wide range of process models to assess their reliability. Each step of the analysis has been commented in detail already. However, one thing that cannot be emphasized enough is the importance of collecting proper datasets: biological replicates (duplicate/triplicate fermentations) but also sample replicates are needed to know the error of the measurements. If the quality of the collected data is not sufficiently high, this might later raise severe questions about the reliability of the resulting model.

Assuming that a decision has been taken to develop a mechanistic model of a pharmaceutical production process, or one of its unit operations, one could, of course, wonder how such a model can be established, and how it can support PAT objectives. In general, construction of a mechanistic model is considered time-consuming, which may explain why data-driven models and chemometrics have been more popular than mechanistic approaches, despite the PAT guidance. However, during the past 5 years, this situation has already changed considerably for small-molecule drug substances [3]. According to us, the tools presented here can be helpful in setting up and structuring the model equations in an efficient way, for example, by making use of matrix notation, which can facilitate transfer of the model equations between different users. Such sharing of modeling knowledge is essential in multidisciplinary process development. As discussed by Sin et al [14], a significant part of such a model matrix can be transferred from one system to a second or a third, which undoubtedly makes the whole model-building exercise more efficient.

Finally, we would also like to emphasize that one should move ahead in small steps when constructing a mechanistic model of a process or unit operation. One should rather start with a smaller model with limited scope, for example, an unstructured model [21]. Such a model could then be gradually extended with more detail, while the development of the production process at laboratory and pilot scale is ongoing. The model analysis tools presented here can then be used in the different stages of the model-building as continuous quality checks of the model.

Once a model is considered ready for use, a first application that is relevant for such a model is to use simulations to propose more informative experiments leading to more accurate estimation of the model parameters, for example, by applying optimal experimental design (OED) [22]. Furthermore, the mechanistic model can be helpful in process design, optimization, and in development of suitable control strategies [23]. The latter applications of the model are essential for implementing PAT principles, and can potentially contribute to more efficient process development, replacing data collection and experiments by simulations whenever possible.

4 Conclusions

Mechanistic models form an attractive alternative for structuring and representing process knowledge, also for production processes in biotechnology. The reliability of such models can be confirmed by performing identifiability, uncertainty, and sensitivity analyses on the resulting model. Tools for performing such analyses can be considered as standard engineering tools and are increasingly available on different software platforms. Once it can be documented that the model is reliable, it can be used for design of experiments, for process optimization and design, and for investigating the usefulness of novel control strategies.