5.1 Constrained Global Optimization

GO and BO have been considered first in “essentially unconstrained” conditions, where the solution was searched for within a bounded-box search space. Recently, due to methodological and application reasons, there has been an increasing interest in constrained global optimization (CGO).

$$\begin{aligned} & x^{ *} = \mathop {{ \arg }\,{ \hbox{min} }}\limits_{x \in X} f\left( x \right) \\ & {\text{Subject}}\,{\text{to}} \\ & c_{i} \left( x \right) \le 0\,i = 1, \ldots ,n_{c} \\ \end{aligned}$$

A paper of general interest about a taxonomy of constraints in simulation-based optimization is given in Le Digabel (2015). Some general remarks have to be taken into account:

  1. 1.

    Even if the constraints are analytically defined, they presence without the convexity assumption raises a whole new set of challenging topics, in particular, the interaction between the feasible region and the surrogate probabilistic model.

  2. 2.

    Since BO assumes the objective function as black box, a natural extension is to consider the constraints as black box as well (unknown constraints).

  3. 3.

    A relevant case is when the objective function is undefined, and cannot be therefore computed, outside the feasible region. In this case, we speak about partially defined objective functions (Rudenko 1994; Sergeyev et al. 2007).

Point three is particularly relevant when the evaluation of the objective requires, as in black box conditions, the execution of a simulation model which can return a valid output or a failure message. An example of failure can be a computational fluid dynamics solver that does not converge due to instability of the numerical scheme (Sacher et al. 2018) or a hydraulic simulator if the input generates pressures or flows physically impossible (Tsai et al. 2018). Also frequent is the case of a complex machine learning model where the training of the model, and therefore the evaluation of the loss function, relies on gradient (stochastic) descent which may fail to converge. Such an event prevents the exploration of a neighbourhood of a “not computable” point and halts the sequential optimization procedure. A naïve solution, in presence of a not computable point, is to associate to it a fixed high (low) penalty value for the objective function to be minimized (maximized): still, determining suitable value is not a trivial task and might imply, anyway, a loss of accuracy in the Gaussian surrogate model and of sample efficiency.

There have been several attempts at leveraging BO framework into dealing with constrained optimization: the main problem is to propose an acquisition function for CBO.

The use of GP-and EI-based heuristics has been first proposed in Jones et al. (1998) allowing for \(c_{i} \left( x \right)\) to be black box and assuming their mutual independence and with the objective function. A GP is given as a prior to each constraint. If \(f_{c}^{ + }\) is the best feasible observation of \(f\), the EI acquisition function is:

$${\text{EI}}\left( {x|f_{c}^{ + } } \right) = \left\{ {\begin{array}{*{20}c} {\left( {\mu \left( x \right) - f_{c}^{ + } } \right){\varvec{\Phi}}\left( Z \right) + \sigma \left( x \right)\phi \left( Z \right) \;{\text{if}} \;\sigma \left( x \right) > 0} \\ {0\;{\text{if}}\;\sigma \left( x \right) = 0} \\ \end{array} } \right.$$

where \(\phi\) and \({\varvec{\Phi}}\) are the probability distribution and the cumulative distribution functions, respectively, and

$$Z = \left\{ {\begin{array}{*{20}c} {\frac{{\mu \left( x \right) - f_{c}^{ + } }}{\sigma \left( x \right)}\,{\text{if}}\,\sigma \left( x \right) > 0} \\ {0\,{\text{if}}\,\sigma \left( x \right) = 0} \\ \end{array} } \right.$$

In presence of constraints the formula becomes:

$${\text{EIC}}\left( {x|f_{c}^{ + } } \right) = {\text{EI}}(x|f_{c}^{ + } )\mathop \prod \limits_{i = 1}^{{n_{c} }} {\mathbb{P}}\left( {c_{i} \left( x \right) \le 0} \right)$$

where the improvement of a candidate solution \(x\) over \(f\) is zero if \(x\) is not feasible. When the noise in the constraints is taken into account, we may not know which observations are feasible: for instance, the best GP mean value satisfying each constraint \(c_{i} \left( x \right)\) with a probability at least \(1 - \delta_{i}\) can be used  (Letham et al. 2019). Gramacy (2016) proposes a different approach for handling constraints in which they are brought into the objective function via a Lagrangian. EI is no longer tractable analytically but can be evaluated numerically via Monte Carlo integration or quadrature (Picheny 2016). Relevant prior results on BO with unknown constraints proposed new acquisition functions, such as integrated expected conditioned improvement (IECI) (Gramacy and Lee 2011). Another approach is presented in Feliot et al. (2017) where an adaptive Random Search is used to approximate feasible region while optimizing the objective function. A penalty approach has been considered first in Gardner et al. (2014), where a penalty is assigned directly to the acquisition function in case of infeasibility, with the aim to move away from infeasible regions. A similar approach has been extended in Candelieri et al. (2018) also in the case in which the function is partially defined: infeasibility is treated by assigning a fixed penalty as value of the objective function (we refer as “BO with penalty”).

A new general approach is offered by information-based methods, which have been extended to the constrained case (e.g. predictive entropy search with constraints, PESC) in Hernandez-Lobato et al. (2015): the code for PESC is included in Spearmint and available at https://github.com/HIPS/Spearmint/tree/PESC.

The above approaches assume that the number of constraints is known a priori and they are statistically independent.

The assumption of independence permits to compute the probability of feasibility simply as the product of individual probabilities with respect to every constraint. The result is multiplied by the acquisition function whose optimization would prefer points satisfying the constraints with high probability.

The issue of partially defined function or equivalently “crash constraints” or “non-computable domains” transforms the GP-based feasibility evaluation from a regression problem into a classification one. This classification approach has been analyzed in several papers and discussed in the following.

In Basudhar et al. (2012) a probabilistic SVM (PSVM) is used to calculate the so-called probability of feasibility and the optimization scheme alternates between a global search for the optimal solution, depending on both this probability and the estimated value of the objective function—modelled through a GP—and a local refinement of the PSVM through an adaptive local sampling scheme. The most common PSVM model is based on the sigmoid function (Vapnik 1998). For a given sample x, the probability of belonging to the +1 class (i.e. “feasible”) is:

$$P\left( { + 1 |x} \right) = \frac{1}{{1 + {\text{e}}^{As\left( x \right) + B} }}$$

The parameters A (A < 0) and B of the sigmoid function are found by maximum likelihood. Two alternative formulations of the acquisition function are proposed:

$$\mathop {\hbox{max} }\limits_{x} {\text{EI}}\left( x \right)P\left( { + 1 |x} \right)$$

and

$$\begin{aligned} & \mathop {\hbox{max} }\limits_{x} {\text{EI}}\left( x \right) \\ & {\text{Subject}}\,{\text{to}} \\ & P\left( { + 1 |x} \right) \ge 0.5 \\ \end{aligned}$$

Other approaches, also based on GP, and focused on cases where the objective function cannot be computed (not computable domains or crash constraints) have been proposed in Sacher et al. (2018); Bachoc et al. (2019).

The previous approaches use independent GPs to model the objective function and the constraints, requiring two strong assumptions: a priori knowledge about the number of constraints and the independence among objective function and all the constraints. Extending EI to account for correlations between constraints and the objective function is still, according to Letham et al. (2019), an open challenge.

To overcome this limitation, a new approach, namely SVM-CBO (Support Vector Machine based Constrained BO), has been proposed in Candelieri and Archetti (2019): the main contribution of the paper is the development of a method which does not require any of the two previous assumptions. The approach uses support vector machine (SVM) to sequentially estimate and model the unknown feasible region (\(\Omega\)) within the search space (i.e. feasibility determination), without any assumption on the number of constraints as well as their independence.

SVM-CBO is organized in two phases: the first is aimed to provide a first estimate of \(\Omega\) (feasibility determination) and the second is BO performed on such an estimate, only. The motivation is that we are interested in obtaining a good approximation of the overall feasible region and not only close to the optimal solution (i.e. feasibility determination is a goal per se, in our approach). Another relevant difference with Basudhar et al. (2012) is that SVM-CBO uses more efficiently the available “budget” (i.e. maximum number of function evaluations): at every iteration, of both phase 1 and 2, we perform just one function evaluation, while in the boundary refinement of Basudar et al. (2012) a given number \(\varvec{n}_{\varvec{p}}\) of function evaluations is performed in the neighbourhood of the next point to evaluate, with the aim to locally refine the boundary estimate. SVM-CBO is detailed in the following section.

5.2 Support Vector Machine—Constrained Bayesian Optimization

We start with the definition of the problem, that is:

$$\mathop {\hbox{min} }\limits_{{x \in \varOmega \subset {\rm X} \subset {\mathbb{R}}^{d} }} f\left( x \right)$$

where \(f\left( x \right)\) has the following properties: it is black box, multi-extremal, expensive and partially defined. The last feature means that \(f\left( x \right)\) is undefined outside that feasible region \(\Omega\), which is a subset of overall bounded-box search space \(X \subset {\mathbb{R}}^{d}\). Moreover, we consider the case that constraints defining the feasible region are also black box.

We introduce some notation that will be used in the following:

  • \(D_{n}^{\varOmega } = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1,..,n}\) is the feasibility determination dataset;

  • \(D_{l}^{f} = \left\{ {\left( {x_{i} ,f\left( {x_{i} } \right)} \right)} \right\}_{i = 1,..,l}\) is the function evaluations dataset, with \(l \le n\) (because \(f\left( x \right)\) is partially defined on \(X\)) and where \(l\) is the number of points where it was possible to compute \(f\) out of the \(n\) queried so far;

and \(x_{i}\) is the i-th queried point and \(y_{i} = \left\{ { + 1, - 1} \right\}\) defines if \(x_{i}\) is feasible or infeasible, respectively.

  • Phase 1Feasibility determination

The first phase of the approach aims to find an estimate \({\tilde{\varOmega }}\) of the actual feasible region \(\Omega\) in M function evaluations (\({\tilde{\varOmega }}_{M} = {\tilde{\varOmega }}\)). The sequence of function evaluations is determined according to an SMBO process where the surrogate model—of the feasible region, in this phase—provides the currently estimated feasible region \({\tilde{\varOmega }}_{n}\). As surrogate model we use the (non-linear) separation hyperplane of an SVM classifier, trained on the set \(D_{n}^{\Omega }\). The SVM classifier uses an RBF kernel to model feasible regions with non-linear boundaries.

Let denote with \(h_{n} \left( x \right)\) the argument of the SVM-based classification function:

$$h_{n} \left( x \right) = \mathop \sum \limits_{i = 1}^{{n_{SV} }} \alpha_{i} y_{i} k\left( {\bar{x}_{i} ,x} \right) + b$$

where \(\alpha_{i}\) and \(y_{i}\) are the Lagrangian coefficient and the “feasibility label” of the i-th support vector, \(\bar{x}_{i}\), respectively, \(k\left( {.,.} \right)\) is the kernel function (i.e. an RBF kernel, in this study), \(b\) is the offset and \(n_{\text{SV}}\) is the number of support vectors.

The boundaries of the estimated feasible region \({\tilde{\varOmega }}_{n}\) are given by \(h_{n} \left( x \right) = 0\) (i.e. non-linear separation hyperplane). The SVM-based classification function provides the estimated feasibility for any \(x \in {\rm X}\):

$$\tilde{y} = {\text{sign}}\left( {h_{n} \left( x \right)} \right) = \left\{ {\begin{array}{*{20}c} { + 1\,{\text{if}}\,x \in {\tilde{\varOmega }}_{n} } \\ { - 1\,{\text{if}}\,x \notin {\tilde{\varOmega }}_{n} } \\ \end{array} } \right.$$

With respect to the aim of the first phase, we propose a “feasibility acquisition function” aimed at identifying the next promising point according to two different goals:

  • Improving the estimate of feasible region

  • Discovering possible disconnected feasible regions.

To deal with the first goal, we use the distance of x from the boundaries of the currently estimated feasible region \({\tilde{\varOmega }}_{n}\), using the following formula from the SVM classification theory:

$$d_{n} \left( {h_{n} \left( x \right),x} \right) = \left| {h_{n} \left( x \right)} \right| = \left| {\mathop \sum \limits_{i = 1}^{{n_{\text{SV}} }} \alpha_{i} y_{i} k\left( {\bar{x}_{i} ,x} \right) + b} \right|$$

To deal with the second goal, we introduce the concept of “coverage of the search space”, defined by:

$$c_{n} \left( x \right) = \sum\limits_{{i = 1}}^{n} {{\text{e}}^{{ - \frac{{\left\| {\bar{x}_{i} - x^{2} } \right\|}}{{2\sigma _{c}^{2} }}}} }$$

So, \(c_{n} \left( x \right)\) is a sum of n RBF functions centred on the points evaluated so far, with \(\sigma_{c}\) a parameter to set the width of the corresponding bell-shaped curve.

Finally, the feasibility acquisition function is given by the sum of \(d_{n} \left( {h_{n} \left( x \right),x} \right)\) and \(c_{n} \left( x \right)\), and the next promising point is identified by solving the following optimization problem:

$$x_{n + 1} = \mathop {\text{argmin}}\limits_{{x \in \,{\text{X}}}} \left\{ {d_{n} \left( {h_{n} \left( x \right),x} \right) + c_{n} \left( x \right)} \right\}$$

Thus, we want to select the point associated with the minimal distance from the boundaries of the current estimated feasible region and the minimal coverage (i.e. max uncertainty). This allows us to balance between improving the estimate of the feasible region and discovering possible disconnected feasible regions (in less explored areas of the search space). It is important to highlight that, in phase 1, the optimization is performed on the overall bounded-box search space \({\rm X}\).

After the function evaluation of the new point \(x_{n + 1}\), the following information is available:

$$y_{n + 1} = \left\{ {\begin{array}{*{20}c} { + 1\,{\text{if}}\,x_{n + 1} \in \varOmega ;f\left( {x_{n + 1} } \right)\,{\text{is}}\,{\text{defined}}} \\ { - 1\,{\text{if}}\,x_{n + 1} \notin \varOmega :f\left( {x_{n + 1} } \right){\text{is}}\,{\text{not}}\,{\text{defined}}} \\ \end{array} } \right.$$

and the following updates are performed:

  • Feasibility determination dataset and estimated feasible region \({\tilde{\varOmega }}_{n + 1}\)

$$\begin{array}{*{20}c} {D_{n + 1}^{\varOmega } = D_{n}^{\varOmega } \cup \left\{ {\left( {x_{n + 1} ,y_{n + 1} } \right)} \right\}} \\ {h_{n + 1} \left( x \right) |D_{n + 1}^{\varOmega } } \\ {n \leftarrow n + 1} \\ \end{array}$$
  • Only if \(x \in \varOmega\), function evaluations dataset

$$\begin{array}{*{20}c} {D_{l + 1}^{f} = D_{l}^{f} \cup \left\{ {\left( {x_{l + 1} , f\left( {x_{l + 1} } \right)} \right)} \right\}} \\ {l \leftarrow l + 1} \\ \end{array}$$

The SMBO process for phase 1 is repeated until \(n = M\).

  • Phase 2Bayesian Optimization in the estimated feasible region

In this phase, a traditional BO process is performed but with the following relevant differences:

  • the search space is not box-bounded but the estimated feasible region \({\tilde{\varOmega }}_{n}\) identified in phase 1

  • the surrogate model—a GP—is fitted only using the feasible solutions observed so far, \(D_{l}^{f}\)

  • the acquisition function for phase 2—lower confidence bound (LCB), in this study—is defined on \({\tilde{\varOmega }}_{n}\), only

Thus, the next point to evaluate is given by:

$$x_{n + 1} = \mathop {\text{argmin}}\limits_{{x \in {\tilde{\varOmega }}_{n} }} \left\{ {{\text{LCB}}_{n} \left( x \right) = \mu_{n} \left( x \right) - \beta_{n} \sigma_{n} \left( x \right)} \right\}$$

where \(\mu_{n} \left( x \right)\) and \(\sigma_{n} \left( x \right)\) are the mean and the standard deviation of the current GP-based surrogate model and \(\beta_{n}\) is the inflate parameter to deal with the trade-off between exploration and exploitation for this phase. It is important to highlight that, contrary to phase 1, the acquisition function is here minimized on \({\tilde{\varOmega }}_{n}\), only, instead of the entire bounded-box search domain \({\rm X}\).

The point \(x_{n + 1}\) is just expected to be feasible, according to \({\tilde{\varOmega }}_{n}\) but the information on its actual feasibility is known only after having checked whether \(f\left( {x_{n + 1} } \right)\) can or cannot be computed (i.e. it is defined or not in \(x_{n + 1}\)). Subsequently, the feasibility determination dataset is updated as follows:

$$D_{n + 1}^{\varOmega } = D_{n}^{\varOmega } \cup \left\{ {\left( {x_{n + 1} ,y_{n + 1} } \right)} \right\}$$

and according to the two alternative cases:

  • \(x_{n + 1}\) is actually feasible: \(x_{n + 1} \in\Omega\), \(y_{n + 1} = + 1;\)

the function evaluations dataset is updated as follows: \(D_{l + 1}^{f} = D_{l}^{f} \cup \left\{ {\left( {x_{l + 1} ,f\left( {x_{l + 1} } \right)} \right)} \right\}\), with \(l \le n\) is the number of the feasible solutions with respect to all the points observed. The current estimated feasible region \({\tilde{\varOmega }}_{n}\) can be considered accurate and retraining of the SVM classifier can be avoided: \({\tilde{\varOmega }}_{n + 1} = {\tilde{\varOmega }}_{n}\)

  • \(x_{n + 1}\) is actually infeasible: \(x_{n + 1} \notin\Omega\), \(y_{n + 1} = - 1;\)

the estimated feasible region must be updated to reduce the risk for further infeasible evaluations

$$h_{n + 1} \left( x \right) | D_{l + 1}^{f} \Rightarrow {\tilde{\varOmega }}_{n + 1}$$

The phase 2 continues until the overall available budget \(n = N\) is reached.

In Candelieri (2019) the SVM-CBO approach has been validated on five 2D test functions for CGO and compared to “BO with penalty”. An overall budget of 100 function evaluations, divided as follows:

  • 10 for initialization through Latin hypercube sampling (LHS);

  • 60 for feasibility estimation (phase 1)

  • and 30 for SMBO constrained to estimated feasible region (phase 2).

In the case of BO with penalty, the same budget has been divided as follows:

  • 10 evaluations for initialization (LHS)

  • and 90 for BO (that is the sum of budget used for phase 1 and phase 2 in the proposed approach).

For each independent run, the initial set of solutions identified through LHS is the same for SVM-CBO and BO with penalty, in order to avoid differences in the values of the gap metric due to different initialization.

The so-called Gap metric has been used to measure the improvement obtained along the SMBO process with respect to global optimum \(f\left( {x^{*} } \right)\) and the initial best solution \(f\left( {x_{0} } \right)\) obtained from the initialization step:

$$G_{n} = \frac{{\left| {f\left( {x_{0} } \right) - f\left( {x^{ + } } \right)} \right|}}{{\left| {f\left( {x_{0} } \right) - f\left( {x^{*} } \right)} \right|}}$$

where \(f\left( {x^{ + } } \right)\) is the “best seen” up to iteration \(n\). Gap metrics varies in the range [0, 1]. For statistical significance, the gap metrics have been computed on 30 different runs, performed for every test function and for both SVM-CBO and BO with penalty.

While gap metric allows for comparing SVM-CBO to BO with penalty, it was important to introduce another performance measure to quantify how good is the approximation of the feasible region, along the SMBO process. We have defined a simple overlap metric as follows:

$$O_{n} \left( {\Omega ,{\tilde{\varOmega }}_{n} } \right) = \left\{ \begin{array}{*{20}ll} \frac{{{\text{Vol}}\left( {{\tilde{\varOmega }}_{n} {\bigcap \varOmega }} \right)}}{{{\text{Vol}}\left(\Omega \right)}}{\text{if}}\;{\text{Vol}}\left( {{\tilde{\varOmega }}_{n} } \right) < {\text{Vol}}\left(\Omega \right) {\bigvee } \Big({\text{Vol}}\left( {{\tilde{\varOmega }}_{n} {\bigcap \varOmega }} \right)\\ \left. \qquad = {\text{Vol}}\left(\Omega \right) {\bigwedge }{\text{Vol}}\left( {{\tilde{\varOmega }}_{n} } \right) = {\text{Vol}}\left(\Omega \right)\right) \\ {\frac{{{\text{Vol}}\left( {{\tilde{\varOmega }}_{n} } \right)}}{{{\text{Vol}}\left(\Omega \right)}}\,{\text{otherwise}}}\\ \end{array} \right.$$

where \(\Omega\) and \({\tilde{\varOmega }}_{n}\) are the actual and the estimated feasible regions, respectively, and \({\text{Vol}}()\) is the volume of a region. More precisely, the volume of a region is computed as an approximation by simply generating a grid of points within the search space and then counting the number of points falling into that region.

According to its definition, the overlap metric can vary in the range \(\left[ {0,\infty } \right)\), where a good approximation of the feasible region is associated with a value equal to 1 while no overlap is associated with a value equal to 0. Values higher than 1 represent situations where \(\Omega \subset {\tilde{\varOmega }}_{n}\).

The following set of figures show the gap metric computed for every test function with respect to the number of function evaluations, excluding the initialization through LHS. The value of gap metric at iteration 0 is the best seen at the end of the initialization step (i.e. \(f\left( {x_{0} } \right)\) in the gap metric formula). Each graph compares the gap metric provided by SVM-CBO and BO with penalty, respectively. Both the average and the standard deviation of the gap metric, computed on 30 independent runs for each approach, are depicted. The higher effectiveness of the proposed approach is clear, even considering the variance in the performances. The end of phase 1 is represented by a vertical dotted line in the charts; a significant improvement of the SVM-CBO’s gap metric is observed after this phase, in every test case. It is important to remind that phase 1 of the SVM-CBO is aimed at approximating the unknown feasible region, while the optimization process only starts with phase 2. Thus, the relevant shift in the gap metric is motivated by the explicit model of the feasible region learned in phase 1 (Fig. 5.1).

Fig. 5.1
figure 1

Comparison of gap metrics for the proposed SVM-CBO approach versus BO with penalty, for each one of the five test functions, respectively

Finally, the following set of figures show how the overlap metric changes sequentially over the SMBO process. Since 10 initial function evaluations are all used to provide a first estimate of the feasible region, the x-axis is limited to 90, that is the portion of the budget allocated for phase 1 and phase 2: during these phases—particularly during phase 1—the estimate of the feasible region is updated and the overlap metric changes consequently. The dotted line represents the mean overlap metric and the shaded area represents the standard deviation. The figures show that the proposed SVM-CBO approach achieves a reasonable overlap (higher than 77%) between estimated and actual feasible region, already after 30–35 function evaluations. The most difficult case was the one with a feasible region consisting of two disconnected ellipses, but, before the end of budget, overlap metric achieved anyway the value of 80% (Fig. 5.2).

Fig. 5.2
figure 2

Overlap metric measuring the quality of feasibility estimation provided by the proposed approach, for each one of the five test functions, respectively

SVM-CBO offers a good estimation of the feasible search space, even for complex feasible search spaces consisting of disconnected regions. Seldom, the BO process, in phase 2, suggested infeasible points but, thanks to the update of the estimated feasible region, it was able to quickly fall close to the optimum, without wasting further function evaluations outside the actual feasible region. Other important considerations relative to significant reduction in computational costs are reported in Candelieri (2019).

In the following, some explicative pictures relative to how SVM-CBO works are reported, considering a 2D test function (i.e. Branin) where the unknown feasible region consists of the inner areas within two disconnected ellipses.

Begin of phase 1: first estimation of the feasible region according to 10 functions evaluations (i.e. initialization through LHS). The following symbols have been used:

  • Solid black line: boundaries of the feasible region (feasible points are inside the ellipses)

  • Dotted blue line: boundaries of the estimated feasible region

  • Green points: optima of (original) Branin’s test function; only one of them is feasible!

  • Blue/black points: evaluations resulting feasible/infeasible

  • Points within diamonds—black or blue: support vectors of the two classes

  • Red-circled points—black or blue: classification errors with respect to the two classes

  • Red asterisk: next point to evaluate.

figure a

Following, modifications of the estimated feasible region along with some iterations (i.e. 1, 5, 9 and 10) of phase 1:

figure b

End of Phase 1 and beginning of Phase 2: level sets for the original box-constrained objective function are also depicted. In the following figure, the estimated and actual feasible regions are reported.

figure c

End of Phase 2 (10 function evaluations more): blue triangles are function evaluations performed in phase 2. All the triangles are in the estimation of the feasible region obtained at the end of phase 1. This means that the estimated feasible region has not been updated during phase 2. The feasible global optimum has been identified in phase 2.

figure d

SVM-CBO has some limitations which can be relevant in some application context. First, as any other constrained global optimization approach, it is not well-suited for dealing with problems where infeasible evaluations are “destructive”. This means that cannot be applied—as is—for solving optimal (online) control of complex industrial systems. This specific kind of problems, known as safe exploration–optimization, will be addressed in the following section. Therefore, simulation–optimization is the setting which might largely benefit from SVM-CBO, such as demonstrated in the case of pump scheduling optimization in water distribution network by using the hydraulic simulation software EPANET 2.0. Secondly, since the SVM classifier models the overall boundary of the feasible region, instead of each individual constraint, a sensitivity analysis cannot be performed. So, the advantage—not only computational—of using one SVM could be partially offset, in the case, we are facing a problem requiring sensitivity analysis.

5.3 Safe Bayesian Optimization

The important difference with respect to CGO is that SafeOpt does not allow—at least with a given probability—any function evaluations outside the feasible region. Moreover, in SafeOpt the objective function is not necessarily partially defined: indeed, a function evaluation is unsafe if the corresponding value of the objective function violates the safety threshold.

It is important to highlight that: contrary to the rest of this book, where the global optimization problem is defined as a minimization problem, for safe optimization we consider a maximization problem, to be coherent with the literature on this topic. Thus, we want to maximize some black-box expensive function without performing function evaluations under a given value.

An approach is presented in Sui et al. (2015), which tries to optimize taking into account the—supposedly known—Lipschitz constant of the objective function. An extension of this approach is presented in Berkenkamp et al. (2016), where the SafeOpt algorithm is applied on high-dimensional problems and without using the information on Lipschitz constant.

The basic problem was stated as: \(\mathop {\hbox{max} }\limits_{x \in X} f\left( x \right)\), subject to the safety constraint \(f\left( x \right) \ge h,\) where \(h\) is a safety threshold.

A new algorithm, namely StageOpt, has been proposed in Sui et al. (2018), where the SafeOpt algorithm has been generalized and extended to be more efficient and applicable to a broader class of problems. The goal is to optimize a black-box objective function \(f:D \to {\mathbb{R}}\) from noisy evaluations at the sample points \(x_{1} ,x_{2} , \ldots ,x_{n} \in X\). Any point in the sequence, when sampled, must be “safe”, which means that for each one of \(m\) unknown safety functions \(g_{i} \left( x \right):X \to {\mathbb{R}}\) it lies above some safety threshold \(h_{i} \in {\mathbb{R}}\).

This optimization problem is formulated as \(\mathop {\hbox{max} }\limits_{x \in X} f\left( x \right)\) subject to the safety constraints \(g_{i} \left( x \right) \ge h_{i} \,{\text{for}}\,i = 1, \ldots ,m.\)

This is a more general formalization of the original SafeOpt (Sui et al. 2015), which used only a simple threshold on the value of the objective function, leading to a single constraint \(g\left( x \right) = f\left( x \right) \ge h\).

As in constrained optimization, GPs are used to model both the objective function and the constraints. An important assumption in StageOpt is that each safety function \(g_{i}\) is \(L_{i}\)-Lipschitz continuous, with respect to some metric on \(X\). This assumption is usually satisfied by the most commonly used kernels (Srinivas et al. 2010; Sui et al. 2015). Additionally, at least one initial “seed” set of safe points must be given, denoted as \(S_{0} \subset X\).

Since safe optimization works by safely expanding \(S_{0}\), it is not guaranteed to identify the global optimizer \(x^{ *}\), specifically in the case that the region around \(x^{ *}\) is disconnected from \(S_{0}\). The key component of SafeOpt is the one-step reachability operator:

$$R_{\varepsilon } \left( S \right)\text{ := }S \cup \mathop {\bigcap }\limits_{i} \left\{ {x \in X| \exists x^{{\prime }} \in S,g_{i} \left( {x^{{\prime }} } \right) - \varepsilon - L_{i} d\left( {x^{{\prime }} ,x} \right) \ge h_{i} } \right\}$$

where S is the current set of safe points (initially S), \(d\) is a distance measure in \({\mathbb{R}}^{d}\) and \(\varepsilon\) is the absolute error in considering \(x\) as safe. This operator allows to identify the set of all the points estimated as safe depending on the previous evaluations of \(f\) on \(S\). Then, given the maximum number of function evaluations, \(N\), we can define the subset of \(X\) reachable after \(N\) iterations from the initial safe seed set \(S_{0}\) as the following: \(R_{\varepsilon }^{N} \left( {S_{0} } \right)\text{ := }R_{\varepsilon } \left( {R_{\varepsilon } \ldots \left( {R_{\varepsilon } \left( S \right)} \right) \ldots } \right)\), \(N\) times. Thus, the optimization problem becomes:

$$x^{*} = {\text{argmax}}_{{x \in R_{\varepsilon }^{T} \left( {S_{0} } \right)}} f\left( x \right).$$

StageOpt separates the safe optimization problem into two stages: an exploration phase in which the safe region is iteratively expanded, followed by an optimization phase in which Bayesian optimization is applied within the safe region. Similarly to SVM-CBO, presented for constrained BO, the overall number of function evaluations, \(N\) is divided into \(N_{1}\) for phase 1 and \(N_{2}\) for phase 2, with \(N = N_{1} + N_{2}\). To map the uncertainty of the GP model at the generic iteration \(n\), StageOpt used the confidence intervals:

$$Q_{n}^{i} \left( x \right)\text{ := }\left[ {\mu_{n}^{i} \left( x \right) \pm \beta_{n} \sigma_{n}^{i} \left( x \right)} \right]$$

where \(\beta_{n}\) defines the level of confidence. It is easy to note that they correspond to the upper and lower confidence bound of the GP. In the formula, superscripts index the corresponding safety functions, while subscripts index iterations, as usual. Then, to guarantee both safety and progress in safe region expansion, StageOpt uses the following confidence intervals \(C_{n + 1}^{i} \left( x \right)\text{ := }C_{n}^{i} \left( x \right) {\bigcap } Q_{n}^{i} \left( x \right)\), with \(C_{0}^{i} \left( x \right)\text{ := }\left[ {h_{i} ,\infty } \right]\) so that \(C_{n + 1}^{i}\) are sequentially contained in \(C_{n}^{i}\) for all \(n = 0, \ldots , N\). Upper and lower bounds of \(C_{n}^{i}\) can be computed and denoted as \(u_{n}^{i}\) and \(l_{n}^{i}\), respectively.

The first stage of StageOpt is safe region expansion. An increasing sequence of safe subsets \(S_{n} \subseteq X\) is computed based on the confidence intervals of the GP posterior:

$$S_{n + 1} = \mathop {\bigcap }\limits_{i} \mathop {\bigcup }\limits_{{x \in S_{n} }} \left\{ {x^{{\prime }} \in X|l_{n + 1}^{i} \left( x \right) - L_{i} d\left( {x,x^{{\prime }} } \right) \ge h_{i} } \right\}$$

At each iteration, StageOpt computes a set of expanders points \(G_{n}\) while definition is based on the function:

$$e_{n} \left( x \right)\text{ := }\left| {\mathop {\bigcap }\limits_{i} \left\{ {x^{{\prime }} \in X{ \setminus }S_{n} |u_{n}^{i} \left( x \right) - L_{i} d\left( {x,x^{{\prime }} } \right) \ge h_{i} } \right\}} \right|$$

which (optimistically) quantifies the potential enlargement of the current safe set after sample a new decision \(x\). Then, \(G_{n}\) is defined as follows:

$$G_{n} = \left\{ {x \in S_{n} :e_{n} \left( x \right) > 0} \right\}$$

Finally, at each iteration StageOpt selects the expander with the highest predictive uncertainty, given by \(x_{n + 1} = \mathop {\text{argmax}}\limits_{{x \in G_{n} }} u_{n}^{i} - l_{n}^{i}\).

The second stage of StageOpt is BO applied within the safe region identified at the end of the first stage. GP-UCB is used as acquisition function.

In Sui et al. (2018) is illustrated a graphical comparison of the behaviour of SafeOpt (Berkenkamp et al. 2016; Sui et al. 2015) and StageOpt starting from the same safe seed point. The figure is reported as follows (Fig. 5.3).

Fig. 5.3
figure 3

Evolution of GPs in SafeOpt and StageOpt for a fixed safe seed; dashed lines correspond to the mean and shaded areas to ±2 standard deviations. The first and third rows depict the utility function, and the second and fourth rows depict a single safety function. The utility and safety functions were randomly sampled from a zero-mean GP with a Matern kernel and are represented with solid blue lines. The safety threshold is shown as the green line, and safe regions are shown in red. The red markers correspond to safe expansions and blue markers to maximizations and optimizations. We see that StageOpt identifies actions with higher utility than SafeOpt. (Source: Sui et al. 2018)

Initially, both algorithms select the same points, as they use the same definition of safe expansion. However, StageOpt selects noticeably better optimization points than SafeOpt due to the UCB criterion.

In real applications, it may be difficult to compute an accurate estimate of the Lipschitz constants, which may have an adverse effect on the definition of the safe region and its expansion dynamics. One might use one of the solutions for updating \(L\), outlined in Chap. 2, like DIRECT or MultL (Sergeyev and Kvasov 2017).

Within the safe optimization framework, one can use a modified version of SafeOpt that defines safe points using only the GPs (Berkenkamp et al. 2016). This modification can be directly applied to StageOpt as well. This alternative algorithm is structured as follows:

  1. 1.

    a point \(\bar{x}\) is considered safe if the lower confidence bound of each safety GPs lies above the respective threshold, that is \(l_{i} \left( {\bar{x}} \right) > h_{i}\)

  2. 2.

    a safe point \({x^{\prime}}\) is considered as an expander depending on an optimistic measure of the expansion from \({x^{\prime}}\). More practically, for each constraint a “temporary” GP is trained on the observations performed so far and an unsafe point x which represents an “artificial” evaluation: the associated value of the constraint function in x is given by the upper confidence bound of the “current” GP (which is trained, contrary to the “temporary” one, only on the observations performed so far). If the lower bound of all the “temporary” GPs, computed at x is greater than the associated safety threshold, then x results in a previously unsafe point which can be considered safe with respect to the optimistic expansion from \({x^{\prime}}\). Safe points \({x^{\prime}}\) that provide an optimistic expansion are considered as expanders.

We must remark that nobody, at our knowledge, has considered the safe optimization problem subject to both the “usual” constraints \(c_{i} \left( x \right)\) and the safety constraints \(g_{i} \left( x \right)\). The main difficulty in this challenge is that evaluating a point that is unsafe - in SafeOpt - implies the termination of the optimization process while evaluating an infeasible point - in CGO - only requires to update the estimate of the feasible region.

5.4 Parallel Bayesian Optimization

Performing function evaluations in parallel is a conceptually appealing way to solve optimization problems in less time that would be required by a sequential approach. Some global optimization methods, like Pure Random Search, are “embarrassingly parallel”: this is not the case of BO, which is an intrinsically sequential process, so that its parallelization raises important design issues.

The main distinction between the parallel methods proposed in literature are between synchronous and asynchronous approaches. With respect to the first class of methods, a multi-point version of EI, called q-EI has been proposed in Ginsbourger et al. (2008), defined as:

$$q{\text{EI}}\left( {x_{n + 1} , \ldots ,x_{n + q} } \right) = {\mathbb{E}}\left[ {\mathop {\hbox{max} }\limits_{{}} \left\{ {\left( {\mathop {\hbox{min} }\limits_{{}} \left( {\mathbf{y}} \right) - y_{n + 1} } \right)^{ + } , \ldots ,\left( {\mathop {\hbox{min} }\limits_{{}} \left( {\mathbf{y}} \right) - y_{n + q} } \right)^{ + } } \right\}} \right]$$

where \({\mathbf{y}} = \left\{ {y_{1} , \ldots ,y_{n} } \right\}\) and \(\left( \cdot \right)^{ + }\) returns 0 if the argument is less than 0, otherwise returns the argument itself. This means to deal with a minimum of dependent random variables: the exact joint distribution of the \(q\) unknown function evaluations, \(y_{n + 1} , \ldots ,y_{n + q}\), conditioned on previous observations through the current GP is given by:

$$\left( {y_{n + 1} , \ldots ,y_{n + q} |D_{1:n} } \right) \sim {\mathcal{N}}\left( {\left( {\mu (x_{n + 1} ), \ldots ,\mu (x_{n + q} )} \right),\Sigma _{q} } \right)$$

where \(\Sigma _{q}\) is the covariance matrix updated with the \(q\) points \(x_{n + 1} , \ldots ,x_{n + q}\).

This synchronous parallelization approach is aimed at measuring the joint potential of a given additional set of q points to evaluate. The main idea is to consider q synchronous processors and to maximize q-EI at each iteration; then the GP model is updated with the entire set of new observations. However, as stated in Ginsbourger et al. (2008), solving the maximization of q-EI can be unaffordable when both the number of dimensions and the number of points q increase.

Computational approaches for parallelizing knowledge gradient (KG) have been also proposed (Wu and Frazier 2016). The parallel KG, namely q-KG, is Bayes-optimal for minimizing the minimum of the predictor of the GP if only one decision is remaining. The q-KG algorithm will reduce to the parallel EI algorithm in the noise-free setting and the final recommendation is restricted to the previous sampled points. The basic issue is that computing q-KG and its gradient is very expensive. Analogously to q-EI, the aim of q-KG is to identify a batch of q points to evaluate, basically in a synchronous setting. Another interesting approach, namely “portfolio allocation”, has been already discussed in Chap. 4.

To address the inefficiency due to the synchronization, a new asynchronous approach, based on EI, is proposed in Ginsbourger et al. (2011). This approach, namely expectation of EI (EEI), focuses on the case of asynchronous parallel BO, where a new function evaluation is performed before all the other processors have produced their result. At every time, observations in \(D_{1:n}\) are divided in three different groups: (i) already evaluated, (ii) currently under evaluation (aka “busy”) and (iii) candidate points for forthcoming function evaluations. The main issue is how to take busy points into account within the GP model.

The issue on how to inject partial knowledge from an ongoing evaluation at a busy point into the EI criterion, without knowing the corresponding actual value of the function, is basically solved probabilistically:

$${\text{EEI}}\left( {.;x^{{{\text{busy}}}} } \right) = {\text{ }}\int {{\text{EI}}\left( {.;x^{{{\text{busy}}}} ,y^{{{\text{busy}}}} } \right){\text{pdf}}\left( {Y\left( {x^{{{\text{busy}}}} } \right)|D} \right)y^{{{\text{busy}}}} {\mkern 1mu} dy^{{{\text{busy}}}} }$$

where \({\text{pdf}}\left( {Y\left( {x^{\text{busy}} } \right)|D} \right)\) is the probability density function of the random variable \(Y\left( {x^{\text{busy}} } \right)\) conditional on the function evaluations performed so far and stored in \(D = D_{1:n}\). It is important to highlight that \(EI\left( {.;x^{\text{busy}} ,y^{\text{busy}} } \right)\) depends on \(y^{\text{busy}}\) in a quite complicated non-linear way, without any chance to have an analytical expression for \({\text{EEI}}\), even if \(Y\left( {x^{\text{busy}} } \right)|D\) is known and simple. Therefore, \({\text{EEI}}\) is approximated by averaging on a sufficient number of \(y^{\text{busy}}\) sampled according to the \(Y\left( {x^{\text{busy}} } \right)|D\).

A straightforward way of getting statistical estimates of \({\text{EEI}}\) is based on Monte Carlo sampling. The \({\text{EEI}}\) approach relies on conditional simulations: it proved to be a sensible criterion, but practically not straightforward to optimize.

Finally, in Kandasamy et al. (2018) a parallel version of Thompson sampling (TS) is proposed. A theoretical analysis proves that a direct application of the sequential TS algorithm, presented in Chap. 3, in either synchronous or asynchronous parallel settings, offers a powerful result: performing n evaluations distributed among q workers is equivalent to performing n evaluations in sequence. Experimental results on simulation–optimization problems and hyperparameters optimization of a convolutional neural network prove that asynchronous TS outperforms many existing parallel BO algorithms.

5.5 Multi-objective Bayesian Optimization

Multi-objective optimization is characterized by a set of objective functions to be minimized \(f_{i} :X \to R\), \(i = 1,..,m\). The solution to this kind of problem is not unique and defined as \(x \in X: {\nexists } \bar{x} \in X\), such that \(f\left( {\bar{x}} \right) \prec f\left( x \right)\), where symbol \(\prec\) denotes the dominance relation, that is:

\(\forall i f_{i} \left( x \right) \le f_{i}\left( \bar{x} \right)\) and \(\exists j\) s.t. \(f_{j} \left( x \right) < f_{j}\left( \bar{x} \right)\)

This allows to identify the Pareto front, defined by the values of different objective functions and the set of corresponding dominant solutions \(x \in X\).

In BO, a typical approach is to model every \(f_{i}\) through a Gaussian process, \({\text{GP}}_{i}\), with \(i = 1 \ldots ., m\). In Emerich and Klinkerberg (2008), the EI acquisition function has been generalized to the multi-objective case by considering the volume of the dominated region: the improvement provided by a new evaluation is the increase in the dominated volume. This extension of EI to the multi-objective case is also known as expected hyper-volume improvement (EHVI).

Given the n function evaluations performed so far, with values \(f\left( {x_{i} } \right) = \left( {f_{1} \left( {x_{i} } \right), \ldots ,f_{m} \left( {x_{i} } \right)} \right)\), \(i = 1, \ldots ,n\), we define the set \(H_{n} = \left\{ {y \in {\mathbb{B}},\exists i \le n,f\left( {x_{i} } \right) \prec y} \right\}\), where \({\mathbb{B}} \subset {\mathbb{R}}^{m}\) is \({\mathbb{B}} = \left\{ {y \in {\mathbb{R}}^{m} ;y \le \bar{y}} \right\}\) with \(\bar{y} \in {\mathbb{R}}^{m}\) is introduced to guarantee that \(H_{n}\) is finite. Thus, \(H_{n}\) is the subset of \({\mathbb{B}}\) whose points are dominated by the previous evaluations.

Finally, the improvement provided by a new evaluation can be measured as: \(I\left( {x_{n + 1} } \right) = \left| {H_{n + 1} } \right| - \left| {H_{n} } \right|\), where \(\left| \cdot \right|\) denotes the volume in \({\mathbb{R}}^{m}\).

Since \(H_{n + 1} \supset H_{n}\), the value \(I\left( {x_{n + 1} } \right)\) is the increase in the volume of the dominated region (Fig. 5.4).

Fig. 5.4
figure 4

Example of an improvement of the dominated region. The regions dominated by \(\varvec{y}_{1}\) and \(\varvec{y}_{2}\) are represented in shaded areas, with darker shades indicating overlapping regions. The hatched area corresponds to the improvement of the dominated region resulting from the observation of \(\varvec{y}_{3}\). (Source Feliot 2017)

Finally, maximizing EHVI means maximizing the expected improvement on \(I\left( {x^{n + 1} } \right)\) and it can be analytically defined as follows:

$${\text{EHVI}}\left( x \right) = \mathop \int \limits_{{{\mathbb{B}}\backslash H_{n} }}^{{}} {\mathbb{P}}_{n} \left( {\xi \left( x \right) \prec y} \right) dy$$

where \({\mathbb{P}}_{n}\) is the probability conditioned on \(x_{1}\), \(\xi \left( {x_{1} } \right)\), …, \(x_{n}\), \(\xi \left( {x_{n} } \right)\), with \(\xi = \left( {\xi_{1} , \ldots , \xi_{m} } \right)\) is the vector of GPs modelling the corresponding objective functions \(f = \left( {f_{1} , \ldots , f_{m} } \right)\).

In Feliot et al. (2017), the EHVI has been extended to the constrained multi-objective optimization case and to the case in which the user expresses preference-order constraints (Abdolshah et al. 2019) The EHVI acquisition function is now considered as the dominant approach and some implementations are available in Matlab and the R packages “Gpareto” and “mlrMBO” (Horn 2015).

The EHVI acquisition function has been also extended in Wada and Hinu (2019) to deal with multi-point search.

An interesting application of multi-objective optimization to BO has been proposed in Calvin and Zilinskas (2019), where the acquisition of the next point to evaluate is defined as a bi-objective optimization problem. In this paper, it is shown that two well-known acquisition function, EI and PI, can be framed into the bi-objective approach, whose the objectives are \(\mu(x)\) and \(\sigma(x)\) of a probabilistic surrogate model.

5.6 Multi-source and Multi-fidelity Bayesian Optimization

Function evaluations in simulation-based experiments or black-box optimization are usually expensive. The basic idea in multi-fidelity is that, in the initial stages of the optimization, “cheaper” evaluation of the objective function might yield anyway considerable information while reducing the computational cost. In this framework, one is considering a collection of information sources, denoted by \(f\left( {x,s} \right)\), where \(s\) represents a specific level of “fidelity” of the source. Usually, the lower \(s\) the higher the fidelity, with \(f\left( {x,0} \right)\) corresponding to the original objective function (Frazier 2018). Decreasing the fidelity decreases the accuracy of the approximation on \(f\left( x \right)\), leading to a reduction in the cost of evaluation but to a poorer accuracy in the estimate of the objective function.

In multi-fidelity, the aim is \(\mathop {\hbox{min} }\limits_{x \in X} f\left( x \right)\) but observing \(f\left( {x,s} \right)\) at a sequence of points and fidelities \(\left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{1:n}\) with a total cost lower than a given budget, that is \(\mathop \sum \nolimits_{i = 1}^{n} c\left( {s_{n} } \right) \le B\), where \(c\left( {s_{n} } \right)\) is the cost to evaluate at fidelity \(s_{n}\) and \(B\) is the given available budget.

Acquisition functions may present some limitations in the development of multi-fidelity BO. Indeed, while KG, ES and PES can be applied, as shown in Poloczek (2017), EI is useless because evaluating \(f\left( {x,s} \right)\) for \(s \ne 0\) never offers an improvement in the best seen (i.e. \({\text{EI}} = 0\) for \(s \ne 0\)), leading to the selection of the highest fidelity source only (Frazier 2018).

Multi-fidelity optimization has been also considered for the optimization of machine learning algorithms (Klein et al. 2017; Kandasamy et al. 2019). The problem of hyperparameter tuning is also considered in Sen et al. (2018) which proposes to train the algorithm on a subsampled version of the whole dataset.

Multi-fidelity optimization has also been gaining a growing importance in several applications also in analogue circuit synthesis (Zhang et al. 2019) and engineering design (Linz et al. 2017) who propose an optimization algorithm that guides the search for solutions on a high-fidelity model through the approximation of a level set from a low-fidelity model. Using the probabilistic branch-and-bound algorithm to approximate a level set for the low-fidelity model, they are able to efficiently locate solutions inside of a target quantile, and therefore reduce the number of high-fidelity evaluations needed in searches.

An interesting application of multi-fidelity BO is presented in Perdikaris et al. (2016) for model inversion in hemodynamics and biomedical engineering (Costabal et al. 2019). A comprehensive multi-fidelity framework, in the context of bandit problems, has been proposed in Kandasamy et al. (2017) where the objective function and its approximations are sampled from a GP. Another interesting work is reported in Ghoreishi and Allaire (2018) which proposes a multi-source information approach for the constrained case.