Keywords

1 Introduction

Experimental design is the workhorse of scientific design and discovery. Bayesian Optimization (BO) has emerged as a powerful methodology for experimental design tasks [1, 2] due to its sample-efficiency in optimizing expensive black-box functions. In its basic form, BO starts with a set of randomly initialized designs and then sequentially suggests the next design until the target objective is reached or the optimization budget is depleted. Theoretical analyses [3, 4] of BO methods have provided mathematical guarantees of sample efficiency in the form of sub-linear regret bounds. While BO is an efficient optimization method, it only uses data gathered during the design optimization process. However, in real world experimental design tasks, we also have access to human experts [5] who have enormous knowledge about the underlying physical phenomena. Incorporating such valuable knowledge can greatly accelerate the sample-efficiency of BO.

Previous efforts in BO literature have incorporated expert knowledge on the shape of functions [6], form of trends [7], priors over optima [8] and model selection [9], which require experts to provide very detailed knowledge about the black-box function. However, most experts understand the process in an approximate or qualitative way, and usually reason in terms of the intermediate abstract properties - the expert will compare designs, and reason as to why one design is better than another using high level abstractions. For instance, consider the design of a spacecraft shield (Whipple shield) consisting of 2 plates separated by a gap to safeguard the spacecraft against micro-meteoroid and orbital debris particle impacts. The design efficacy is measured by observing the penetration caused by hyper-velocity debris. An expert would reason why one design is better than another and accordingly come up with a new design to try out. As part of their domain knowledge, human experts often expect the first plate to shatter the space debris while the second to absorb the fragments effect. Based on these abstract intuitions, the expert will compare a pair of designs by examining the shield penetration images and ask: Does the first plate shatter better (Shattering)? Does the second plate absorb the fragments better (Absorption)? The use of such abstractions allows experts to predict the overall design objective thus resulting in an efficient experimental design process. It is important to note that measuring such abstractions is not usually feasible and only expert’s qualitative inputs are available. Incorporating such abstract properties in BO for the acceleration of experimental design process is not well explored.

In this paper, we propose a novel human-AI collaborative approach - Bayesian Optimization with Abstract Properties (BOAP) - to accelerate BO by capturing expert inputs about the abstract, unmeasurable properties of the designs. Since expert inputs are usually qualitative [10] and often available in the form of design preferences based on abstract properties, we model each abstract property via a latent function using the qualitative pairwise rankings. We note that eliciting such pairwise preferences about designs does not add significant cognitive overhead for the expert, in contrast to asking for explicit knowledge about properties. We fit a separate rank Gaussian process [11] to model each property. Our framework allows enormous flexibility for expert collaborations as it does not need the exact value of an abstract property, just its ranking. A schematic of our proposed BOAP framework is shown in Fig. 1.

Although we anticipate that experts will provide accurate preferences on abstract properties, the expert preferential knowledge can sometimes be misleading. Therefore to avoid such undesired bias, we use two models for the black-box function. The first model uses a “main” Gaussian Process (GP) to model the black-box function in an augmented input space where the design variables are augmented with the estimated abstract properties modeled via their respective rank GPs. The second model uses another “main” GP to model the black-box function using the original design space without any expert inputs. At each iteration, we use predictive likelihood-based model selection to choose the “best” model that has higher probability of finding the optima.

Fig. 1.
figure 1

A schematic representation of our proposed framework Bayesian Optimization with Abstract Properties (BOAP).

Our contributions are: (i) we propose a novel human-AI collaborative BO algorithm (BOAP) for incorporating the expert pairwise preferences on abstract properties via rank GPs (Sect. 3), (ii) we provide a brief discussion on the convergence behavior of our proposed BOAP method (Sect. 4), (iii) we provide empirical results on both synthetic optimization problems and real-world design optimization problems to prove the usefulness of BOAP framework (Sect. 5).

2 Background

Notations

We use lower case bold fonts \(\textbf{v}\) for vectors and \(v_{i}\) for each element in \(\textbf{v}\). \(\textbf{v}^{\intercal }\) is the transpose. We use upper case bold fonts (and bold greek symbols) \(\textbf{M}\) for matrices and \(M_{ij}\) for each element in \(\textbf{M}\). \(\mathrm{\textrm{abs}(\cdot )}\) is the absolute value. \(\mid \cdot \mid \) is the determinant. \(\mathbb {N}_{n}=\{1,2,\cdots ,n\}\). \(\mathbb {R}\) for Reals. \(\mathcal {X}\) is a index set and \(\textbf{x}\in \mathcal {X}\).

2.1 Bayesian Optimization

Bayesian Optimization (BO) [12, 13] provides an elegant framework for finding the global optima of an expensive black-box function \(f(\textbf{x})\), given as \(\textbf{x}^{\star }\in \text {argmax}_{\textbf{x}\in \mathcal {X}}f(\textbf{x})\), where \(\mathcal {X}\) is a compact search space. BO is comprised of two main components: (i) a surrogate model (usually a Gaussian Process [11]) of the unknown objective function \(f(\textbf{x})\), and (ii) an Acquisition Function \(u(\textbf{x})\) [14] to guide the search for optima.

Gaussian Process. A Gaussian Process (GP) [11] is a flexible, non-parametric distribution over functions. It is a preferred surrogate model because of its simplicity and tractability, in contrast to other surrogate models such as Student-t process [15] and Wiener process [16]. A GP is defined by a prior mean function \(\mu (\textbf{x})\) and a kernel \(k:\mathcal {X}\times \mathcal {X}{\rightarrow }\mathbb {R}\). The function \(f(\textbf{x}\)) is modeled using a GP as \(f(\textbf{x})\sim \mathcal{G}\mathcal{P}(0,k(\textbf{x},\textbf{x}')).\) If \(\mathcal {D}_{1:t}=\{\textbf{x}_{1:t},\textbf{y}_{1:t}\}\) denotes a set of observations, where \(y=f(\textbf{x})+\eta \) is the observation corrupted with noise \(\eta \in \mathcal {N}(0,\sigma _{\eta }^{2})\) then, according to the properties of GP, the observed samples \(\mathcal {D}_{1:t}\) and a new observation \((\textbf{x}_{\star },f(\textbf{x}_{\star }))\) are jointly Gaussian. Thus, the posterior distribution \(f(\textbf{x}_{\star })\) is \(\mathcal {N}(\mu (\textbf{x}_{\star }),\sigma ^{2}(\textbf{x}_{\star }))\), where \(\mu (\textbf{x}_{\star })=\textbf{k}^{\intercal }[\textbf{K}+\sigma _{\eta }^{2}\textbf{I}]^{-1}\textbf{y}_{1:t}\), \(\sigma ^{2}(\textbf{x}_{\star })=k(\textbf{x}_{\star },\textbf{x}_{\star })-\textbf{k}^{\intercal }[\textbf{K}+\sigma _{\eta }^{2}\textbf{I}]^{-1}\textbf{k}\), \(\mathbf {k=}[k(\textbf{x}_{\star },\textbf{x}_{1})\cdots k(\textbf{x}_{\star },\textbf{x}_{t})]^{\intercal }\), and \(\textbf{K}=[k(\textbf{x}_{i},\textbf{x}_{j})]_{i,j\in \mathbb {N}_{t}}\).

Acquisition Functions. The acquisition function selects the next point for evaluation by balancing the exploitation vs exploration (i.e. searching in high value regions vs highly uncertain regions). Some popular acquisition functions include Expected Improvement (EI) [17], GP-UCB [3] and Thompson Sampling (TS) [18]. A standard BO algorithm is provided in Sect. 8 of the supplementary materialFootnote 1.

2.2 Rank GP Distributions

[19] demonstrated that humans are better at providing qualitative comparisons than absolute magnitudes. Thus modeling latent human preferences is crucial when optimization objectives are in domains such as A/B testing of web designing [20], recommender systems [21], players skill rating [22] and many more. [23] proposed a non-parametric Bayesian algorithm for learning instance or label preferences. We now discuss modeling pairwise preference relations using rank GPs.

Consider a set of n distinct training instances denoted by \(X=\{\textbf{x}_{i}\,\forall i\in \mathbb {N}_{n}\}\) based on which pairwise preference relations are observed. Let \(P=\{(\textbf{x}\succ \textbf{x}')\,|\;\textbf{x},\textbf{x}'\in X\}\) be a set of pairwise preference relations, where the notation \(\textbf{x}\succ \textbf{x}'\) expresses the preference of instance \(\textbf{x}\) over \(\textbf{x}'\). For example, the pair \(\{\textbf{x},\textbf{x}'\}\) can be two different spacecraft shield designs and \(\textbf{x}\succ \textbf{x}'\) implies that spacecraft design \(\textbf{x}\) is preferred over \(\textbf{x}'\). [23] assume that each training instance is associated with an unobservable latent function value \(\{\bar{f}(\textbf{x})\}\) measured from an underlying hidden preference function \(\bar{f}:\mathbb {R}^{d}\rightarrow \mathbb {R}\), where \(\textbf{x}\succ \textbf{x}'\), implies \(\bar{f}(\textbf{x})>\bar{f}(\textbf{x}')\). Employing an appropriate GP prior and likelihood, user preference can be modeled via rank Gaussian process distributions.

Preference learning has been used in BO literature [24, 25]. [24] proposed Preferential BO (PBO) to model the unobserved objective function using a binary design preferential feedback. [26] modified PBO to compute posteriors via skew GPs. [27] proposed a preference learning based BO to model preferences in a multi-objective setup using multi-output GPs. [28] proposed a preference learning with Siamese Networks to capture preferences in a Multi-task learning setup. All these works incorporate preferences about an unobserved objective function. However, in this paper, we use preference learning to model expert preferences about the intermediate abstract (auxiliary) properties. Our latent model learned using such preferential data is then used as an input to model the main objective function.

3 Framework

This paper addresses the global optimization of an expensive, black-box function f, i.e., we aim to find the global optima (\(\textbf{x}^{\star }\)) of the unknown objective function f represented as:

$$\begin{aligned} \textbf{x}^{\star }\in \underset{\textbf{x}{\in }\mathcal {X}}{{\text {argmax}}f(\textbf{x})} \end{aligned}$$
(1)

where \(f:\mathcal {X}\rightarrow \mathbb {R}\) is a noisy and expensive objective function. For example, f could be a metric signifying the strength of the spacecraft shield. The motivation of this research work is to model f by capturing the cognitive knowledge of experts in making preferential decisions based on the inherent non-measurable abstract properties of the possible designs. The objective here is same as that of standard BO i.e., to find the optimal design (\(\textbf{x}^{\star }\)) that maximizes the unknown function f, but in the light of expert preferential knowledge on abstract properties. The central idea is to use preferential feedback to model and utilize the underlying higher-order properties that underpin preferential decisions about designs. We propose Bayesian Optimization with Abstract Properties (BOAP) for the optimization of f in the light of expert preferential inputs. First, we discuss expert knowledge about abstract properties. Next, we discuss GP modeling of f with preferential inputs, followed by a model-selection step that is capable of overcoming a futile expert bias in preferential knowledge. A complete algorithm for BOAP is presented in Algorithm 1 at the end of this section.

3.1 Expert Preferential Inputs on Abstract Properties

In numerous scenarios, domain experts reason the output of a system in terms of higher-order properties \(\omega _{1}(\textbf{x}),\omega _{2}(\textbf{x}),\dots \) of a design \(\textbf{x}\in \mathcal {X}\). However, these abstract properties are rarely measured, only being accessible via expert preferential inputs. For instance, a material scientist designing spacecraft shield can easily provide her pairwise preferences on the properties such as shattering, shock absorption, i.e., “this design absorbs shock better than that design”, in contrast to specifying the exact measurements of shock absorption. These properties can be simple physical properties or abstract combinations of multiple physical properties which an expert uses to reason about the output of a system. We propose to incorporate such qualitative properties accessible to the expert in the surrogate modeling of the given objective function to further accelerate the sample-efficiency of BO.

Let \(\omega _{1:m}(\textbf{x})\) be a set of m abstract properties derived from the design \(\textbf{x}\in \mathcal {X}\). For property \(\omega _{i}\), design \(\textbf{x}\) is preferred over design \(\textbf{x}'\) if \(\omega _{i}(\textbf{x})>\omega _{i}(\textbf{x}')\). We denote the set of preferences provided on \(\omega _{i}\) as \(P^{\omega _{i}}=\{(\textbf{x}\succ \textbf{x}')\,\text {if}\,\omega _{i}(\textbf{x})>\omega _{i}(\textbf{x}')\,|\;\textbf{x}\in \mathcal {X}\}\).

Rank GPs for Abstract Properties. We capture the aforementioned expert preferential data for each of the abstract properties \(\omega _{1:m}\) individually using m separate rank Gaussian process distributions [23]. In conventional GPs the observation model consists of a map of input-output pairs. In contrast, the observation model of a rank (preferential) Gaussian process (\(\mathcal{G}\mathcal{P}\)) consists of a set of instances and a set of pairwise preferences between those instances. The central idea here is to capture the ordering over a set of n instances \(X=\{\textbf{x}_{i}\,|\,\forall i\in \mathbb {N}_{n}\}\) by learning latent preference functions \(\{\omega _{i}\,|\,\forall i\in \mathbb {N}_{m}\}\). We denote such a rank GP modeling abstract property \(\omega _{i}\) by the notation \(\mathcal{G}\mathcal{P}_{\omega _{i}}\).

Let \(\mathcal {X}\in \mathbb {R}^{d}\) be a \(d-\)dimensional compact search space and \(X=\{\textbf{x}_{i}\,|\,\forall i\in \mathbb {N}_{n}\}\) be a set of n training instances. Let \(\boldsymbol{\omega }=\{\omega (\textbf{x})\}\) be the unobservable latent preference function values associated with each of the instances \(\textbf{x}\in X\). Let P be the set of p pairwise preferences between instances in X, defined as:

$$ P=\{(\textbf{x}\succ \textbf{x}')_{j}\,\text {if}\,\omega (\textbf{x})>\omega (\textbf{x}')\,|\;\textbf{x}\in X,\forall j\in \mathbb {N}_{p}\} $$

where \(\omega \) is the latent preference function. The observation model for the rank GP distribution \(\mathcal{G}\mathcal{P}_{\omega }\) modeling the latent preference function \(\omega \) is given as:

$$ \mathcal {\bar{D}}=\{\textbf{x}_{1:n},P=\{(\textbf{x}\succ \textbf{x}')_{j}\,\forall \textbf{x},\textbf{x}'\in X,j\in \mathbb {N}_{p}\}\} $$

We follow the probabilistic kernel approach for preference learning [23] to formulate the likelihood function and Bayesian probabilities. Imposing non-parametric GP priors on the latent function values \(\mathbf {\boldsymbol{\omega }}\), we arrive at the prior probability of \(\boldsymbol{\omega }\) given by:

$$\begin{aligned} \mathcal {P}(\mathbf {\boldsymbol{\omega }})=(2\pi )^{-\frac{n}{2}}|\textbf{K}|^{-\frac{1}{2}}\exp \big ({\scriptstyle -\frac{1}{2}}\mathbf {\boldsymbol{\omega }}^{\intercal }\textbf{K}^{-1}\mathbf {\boldsymbol{\omega }}\big ) \end{aligned}$$
(2)

With suitable noise assumptions \(\mathcal {N}(0,\tilde{\sigma }_{\eta }^{2})\) on inputs and the preference relations \((\textbf{x},\textbf{x}')_{1:p}\) in P, the Gaussian likelihood function based on [29] is:

$$\begin{aligned} \mathcal {P}(\mathbf {(\textbf{x}\succ \textbf{x}')}_{i}|\omega (\textbf{x}),\omega (\textbf{x}'))=\varPhi \big (z_{i}(\textbf{x},\textbf{x}')\big ) \end{aligned}$$
(3)

where \(\varPhi \) is the c.d.f of standard normal distribution and \(z(\textbf{x},\textbf{x}')=\frac{\omega (\textbf{x})-\omega (\textbf{x}')}{\sqrt{2\tilde{\sigma }_{\eta }^{2}}}\). Based on Bayes theorem, the posterior distribution of the latent function given the data is given by:

$$ \mathcal {P}(\boldsymbol{\omega }|\mathcal {\bar{D}})=\frac{\mathcal {P}(\boldsymbol{\omega })}{\mathcal {P}(\mathcal {\bar{D}})}\mathcal {P}(\mathcal {\bar{D}}|\boldsymbol{\omega }) $$

where \(\mathcal {P}(\mathbf {\boldsymbol{\omega }})\) is the prior distribution (Eq. (2)), \(\mathcal {P}(\mathcal {\bar{D}})=\int \mathcal {P}(\mathcal {\bar{D}}|\boldsymbol{\omega })\mathcal {P}(\boldsymbol{\omega })\,d\boldsymbol{\omega }\) is the evidence of model parameters including kernel hyperparameters, and \(\mathcal {P}(\mathcal {\bar{D}}|\boldsymbol{\omega })\) is the probability of observing the pairwise preferences given the latent function values \(\boldsymbol{\omega }\), which can be computed as a product of the likelihood (Eq. (3)) i.e., \(\mathcal {P}(\mathcal {\bar{D}}|\mathbf {\boldsymbol{\omega }})=\prod _{p}\mathcal {P}(\mathbf {(\textbf{x}\succ \textbf{x}')}_{p}|\omega (\textbf{x}),\omega (\textbf{x}'))\). We find the posterior distribution using Laplace approximation and the Maximum A Posteriori estimate (MAP) \(\boldsymbol{\omega }_{\text {MAP}}\) as the mode of posterior distribution. We can find the MAP using Newton-Raphson descent given by:

$$\begin{aligned} \boldsymbol{\omega }^{\text {new}}=\boldsymbol{\omega }^{\text {old}}-\textbf{H}^{-1}\textbf{g}|_{\boldsymbol{\omega }=\boldsymbol{\omega }^{\text {old}}} \end{aligned}$$
(4)

where the Hessian \(\textbf{H}=[\textbf{K}+\tilde{\sigma }_{\eta }^{2}\textbf{I}]^{-1}+\textbf{C}\), and the gradient \(\textbf{g}=\nabla _{\boldsymbol{\omega }}\log \,\mathcal {P}(\boldsymbol{\omega }|\mathcal {\bar{D}})=-[\textbf{K}+\tilde{\sigma }_{\eta }^{2}\textbf{I}]^{-1}\boldsymbol{\omega }+\textbf{b}\), given \(b_{j}=\frac{\partial }{\partial \omega (\textbf{x}_{j})}\sum \limits _{p}\ln \varPhi (z_{p})\) and \(C_{ij}=\frac{-\partial ^2}{\partial \omega (\textbf{x}_{i})\partial \omega (\textbf{x}_{j})}\sum \limits _{p}\ln \varPhi (z_{p})\).

Hyperparameter Optimization. Kernel hyperparameters (\(\theta \)) are crucial to optimize the generalization performance of the GP. We perform the model selection for our rank-GPs by maximizing the corresponding log-likelihood in the light of latent values \(\boldsymbol{\omega }_{\text {MAP}}\). In contrast to the evidence maximization mentioned in [23] i.e., \(\theta ^{\star }_\omega =\text {argmax}_{\theta _\omega }\,\mathcal {P}(\mathcal {\bar{D}}|\theta _\omega )\), we find the optimal kernel hyperparameters by maximizing the log-likelihood (\(\mathcal {\bar{L}}\)) of rank GPs i.e., \(\theta ^{\star }_\omega =\text {argmax}_{\theta _\omega }\,\mathcal {\bar{L}}\). The closed-form of log-likelihood of the rank GP is given as:

(5)

3.2 Augmented GP with Abstract Property Preferences

To account for property preferences in modeling f, we augment the input \(\textbf{x}\) of a conventional GP modeling f with the mean predictions obtained from m rank GPs (\(\mathcal{G}\mathcal{P}_{\omega _{1:m}}\)) as auxiliary inputs capturing the property preferences \(\omega _{1:m}\), in other words, instead of modeling GP directly on \(\textbf{x}\) we model on \(\tilde{\textbf{x}}=[\textbf{x},\mu _{\omega _{1}}(\textbf{x}),\cdots ,\mu _{\omega _{m}}(\textbf{x})]\), where \(\mu _{\omega _{i}}\) is the predictive mean computed using:

$$\begin{aligned} \mu _{\omega _{i}}(\textbf{x})=\textbf{k}^{\intercal }[\mathbf {K+}\sigma _{\eta }^{2}\textbf{I}]^{-1}\boldsymbol{\omega }_{\text {MAP}} \end{aligned}$$

where \(\textbf{k}=[k(\textbf{x},\textbf{x}_{1}),\cdots ,k(\textbf{x},\textbf{x}_{n})]^{\intercal },\) \(\textbf{K}=[k(\textbf{x}_{i},\textbf{x}_{j})]_{i,j\in \mathbb {N}_{n}}\) and \(\mathbf {\textbf{x}}_{i}\in X\). To handle different scaling levels in rank GPs, we normalize its output in the interval [0, 1], such that \(\mu _{\omega _{i}}(\textbf{x})\in [0,1]\).

Although we model \(\tilde{\textbf{x}}\) using mean predictions \(\mu _{\omega _{i}}(\textbf{x})\), the uncertainty estimates were not (directly) considered in the modeling. The GP predictive variance tends to be high outside of the neighborhood of observations, indicating the uncertainty in our beliefs on the model. Therefore, a data point with high predictive variance \((\sigma _{\omega _{1}}(\textbf{x}))^{2}\) in rank GP indicates the model uncertainty. We incorporate this uncertainty in our main GP modeling \(\tilde{\textbf{x}}\) such that the effects of predicted abstract properties \(\mu _{\omega _{i}}(\textbf{x})\) are appropriately reduced when the model is uncertain i.e. when \((\sigma _{\omega _{i}}(\textbf{x}))^{2}\) is high.

To achieve this, we formulate the feature-wise lengthscales as a function of predictive uncertainty of the augmented dimensions to control their importance in the overall GP. Note that augmented features can be detrimental when the model is uncertain. To address this potential problem, we use a spatially varying kernel [6] that treats the lengthscale as a function of the input, rather than a constant. A positive definite kernel with spatially varying lengthscale is given as:

$$\begin{aligned} k(\textbf{x},\textbf{x}')=\prod _{i=1}^d\sqrt{\frac{2l(x_{i})l(x'_{i})}{l^{2}(x_{i})+l^{2}(x'_{i})}}\,\exp \bigg (-\sum _{i=1}^d\frac{(x_{i}-x'_{i})^{2}}{l^{2}(x_{i})+l^{2}(x'_{i})}\bigg ) \end{aligned}$$
(6)

where \(l(\cdot )\) is the lengthscale function and \(\textbf{x} \in \mathbb {R}^d\). In our proposed framework, we model \(\tilde{\textbf{x}}\in \mathbb {R}^{d+m}\) and use lengthscale as a function \(l(\cdot )\) only for the newly augmented (m) dimensions and retain the lengthscales of the original (d) dimensions to standard constant values i.e. \(l(x_i)=l_i \,\, \forall i \in \mathbb {N}_d\). Therefore the overall kernel hyperparameter set is given as \(\theta =[l_{1},\cdots ,l_{d},l_{\omega _{1}}(\textbf{x}),\cdots ,l_{\omega _{m}}(\textbf{x})]\). As we need lengthscale function to reflect the model uncertainty, we set \(l_{\omega _{i}}(\textbf{x})=\alpha _i\tilde{\sigma }_{\omega _{i}}(\textbf{x})\), where \(\tilde{\sigma }_{\omega _{i}}(\textbf{x})\) is the normalized standard deviation of the rank GP predicted for the abstract property \(\omega _{i}\) and \(\alpha _i\) is a scale parameter that is tuned using the standard GP log-marginal likelihood in conjunction with other kernel parameters. The aforementioned lengthscales ensure that the data points \(\tilde{\textbf{x}}\) with high model uncertainty have higher lengthscale on the augmented dimensions and thus are treated as less important.

The objective function is modeled on the concatenated inputs \(\tilde{\textbf{x}}\in \mathbb {R}^{d+m}\) using the spatially varying kernel (Eq. (6)) \(k(\tilde{\textbf{x}},\tilde{\textbf{x}}) \,\, \forall \tilde{\textbf{x}}\in \mathbb {R}^{d+m}\) and we denote this function with augmented inputs \(\tilde{\textbf{x}}\) as human-inspired objective function \(h(\mathbf {\tilde{x}})\). The GP (\(\mathcal{G}\mathcal{P}_{h}\)) constructed in the light of expert preferential data is then used in BO to find the global optima of \(h(\mathbf {\tilde{x}})\), given as:

$$\begin{aligned} \textbf{x}^{\star }\in \underset{\mathbf {\textbf{x}\in \mathcal {X}}}{{\text {argmax}}}\,h(\mathbf {\tilde{x}}) \end{aligned}$$
(7)

The observation model is \(\mathcal {D}=\{(\textbf{x},y=h(\tilde{\textbf{x}})\approx f(\textbf{x}))\}\) i.e. the human-inspired objective function \(h(\tilde{\textbf{x}})\) is a simplified \(f(\textbf{x})\) with auxiliary features in the input, thus we observe the \(h(\tilde{\textbf{x}})\) via \(f(\textbf{x})\). The kernel hyperparameters associated with \(\mathcal{G}\mathcal{P}_{h}\) are denoted as \(\theta _h\) given as \(\theta _h=\{l_{1:d},\alpha _{1:m}\}\).

3.3 Overcoming Inaccurate Expert Inputs

Up to this point we have assumed that expert input is accurate and thus likely to accelerate BO. However, in some cases this feedback may be inaccurate, and potentially slowing optimization. To overcome such bias and encourage exploration we maintain 2 models, one of which is augmented by expert abstract properties (we refer to this as Human Arm-\(\mathfrak {h}\)) and an un-augmented model (we refer to this as Control Arm-\(\mathfrak {f}\)), and use predictive likelihood to select the arm at each iteration.

The control arm models f directly by observing the function values at suggested candidate points. Here, we fit a standard GP (\(\mathcal{G}\mathcal{P}_{f}\)) based on the data collected i.e., \(\mathcal {D}=\{(\textbf{x},y=f(\textbf{x})+\eta )\}\) where \(\eta \sim \mathcal {N}(0,\sigma _{\eta }^{2})\) is the Gaussian noise. The GP distribution (\(\mathcal{G}\mathcal{P}_{f})\) with hyperparameters \(\theta _f=\{l_{1:d}\}\) may be used to optimize f using a BO algorithm.

At each iteration t, we compare the predictive likelihoods (\(\mathcal {L}_{t}\)) of both the human augmented arm (Arm-\(\mathfrak {h}\)) and the control arm (Arm-\(\mathfrak {f}\)) to select the arm to pull for suggesting the next promising candidate for the function evaluation. Then, we use Thompson Sampling (TS) strategy [18] to draw a sample \(S_{t}\) from the GP distribution of the arm pulled and find its corresponding maxima given as:

$$\begin{aligned} \textbf{x}_{t}^{\mathfrak {h}}={\underset{\textbf{x}\in \mathcal {X}}{{\text {argmax}}}\,}(S^{{\mathfrak {h}}}(\tilde{\textbf{x}}));\quad \textbf{x}_{t}^{\mathfrak {f}}={\underset{\textbf{x}\in \mathcal {X}}{{\text {argmax}}}\,}(S^{{\mathfrak {f}}}(\textbf{x})) \end{aligned}$$
(8)

The arm with maximum predictive likelihood is chosen at each iteration and we observe f at the suggested location i.e., \((\textbf{x}_{t}^{\mathfrak {h}},f(\textbf{x}_{t}^{\mathfrak {h}}))\) or \((\textbf{x}_{t}^{\mathfrak {f}},f(\textbf{x}_{t}^{\mathfrak {f}}))\). Then rank GPs are updated to capture the preferences with respect to the new suggestion \(\textbf{x}_{t}^{\mathfrak {h}}\) or \(\textbf{x}_{t}^{\mathfrak {f}}\). This process continues until the evaluation budget T is exhausted. A complete flowchart of our framework is shown in Fig. 2. Additional details of BOAP framework are provided in the supplementary material (Sect. 9).

Fig. 2.
figure 2

A complete process flowchart of our proposed BOAP framework.

Algorithm 1
figure a

BO with Preferences on Abstract Properties (BOAP)

4 Convergence Remarks

In this section we discuss the convergence of our BOAP algorithm in terms of regret bounds. As we are dealing with human expert feedback in our algorithm, it is difficult to make absolute statements as we are reliant on the accuracy of the feedback given and the knowledge of the expert involved, which may be limited if the objective must explore less thoroughly understood areas of the search space (so the expert learns alongside the GP model). Nevertheless, with minimal assumptions we may draw some conclusions that help us to better understand the impact of expert feedback, which is important not only to better understand the potential of BOAP to accelerate convergence but also to give insight into possible future directions.

BOAP may be understood as kernel learning in practice - the core distinction between the human and control arms is that the human arm features an evolving kernel (6). Assuming for simplicity that the human arm comes to dominate over time (as measured by likelihood) then the influence of the kernel (6) on the convergence of the BO algorithm is measured through the maximum information gain \(\gamma _T(d)\), where d is the input dimension. For Thompson sampling type algorithm the cumulative regret \(R_T =\sum _t f(\textbf{x}^{\star }) - f(\textbf{x}_t)\) is typically [4, 30] bounded as \(R_T=\mathcal {O}(\sqrt{T\gamma _T(d)})\) (up to log factors), where the maximum information gain \(\gamma _T(d)\) is governed by the kernel K through the eigenvalues \(\lambda _1 \ge \lambda _2 \ge \ldots \) of the covariance matrix \(\textbf{K}_T\) evaluated on the observations \(\{ (\textbf{x}_t,y_t) : t \le T \}\) [3]:

$$\begin{aligned} \begin{array}{c} \gamma _T(d) \le \frac{\frac{1}{2}}{1-\frac{1}{e}} \mathop {\max }\limits _{(m_t : \sum _t m_t = T)} \mathop {\sum }\nolimits _{t=1}^{|\mathcal {D}|} \log \left( 1 + \sigma ^{-2} m_t \lambda _t \right) \end{array} \end{aligned}$$
(9)

Moreover in [3] it is shown that the asymptotic behavior of \(\gamma _T(d)\) is controlled by the dimension d of the input. Our key insight for the kernel (6) is that we can drop features with lengthscales over a threshold without overly perturbing the kernel, effectively replacing d with the number of features (\(d_\textrm{eff}\)) having lengthscales below the threshold, bounding the resulting error so introduced.

Assuming that (a) expert observations obey a simple convergence assumption \(\max _{\textbf{x}\in \mathcal {X},t} K_{\omega _i} (\textbf{x},\textbf{x}_t) = \mathcal {O} (g(T))\), where \(g(T) \rightarrow 0\) as \(T \rightarrow \infty \), and (b) as \(T \rightarrow \infty \), only \(d_\textrm{eff} < d\) of the lengthscales (augmenting or otherwise) satisfy \(l_d, l_{\omega _i} < l_\textrm{max}\), then, for the kernel (6), \({\gamma }_T (d) \le \breve{\gamma }_T (d_\textrm{eff}) + \mathcal {O} (\frac{d_\textrm{eff}}{l_\textrm{max}^2}) + \mathcal {O} (g(T))\). In this expression \(\breve{\gamma }_T(d_\textrm{eff})\) is the maximum information gain for a \(d_\textrm{eff}\) dimensional SE kernel, i.e. [3] \(\breve{\gamma }_T (d_\textrm{eff}) = \mathcal {O} ((\log T)^{d_\textrm{eff}+1})\). Thus we would expect cumulative regret to satisfy:

$$\begin{aligned} \begin{array}{l} R_T = \mathcal {O} \left( \sqrt{T \left( \left( \log T \right) ^{d_\textrm{eff}+1}+\frac{d_\textrm{eff}}{l_\textrm{max}^2} \right) } \right) \end{array} \end{aligned}$$
(10)

That is, the regret bound for BOAP, assuming the human arm dominates as \(T \rightarrow \infty \), is the the regret bound for BO with effective dimension \(d_\textrm{eff}\) plus a term that scales as the ratio of \(d_\textrm{eff}\) and the cut-off lengthscale \(l_\textrm{max}^2\). The more effectively the augmenting features are able to summarize the data in a useful way that renders other features superfluous (i.e. minimizes \(d_\textrm{eff}\)), the tighter the regret bound becomes. A detailed discussion on the maximum information gain and the regret bounds is provided in the supplementary material (Sect. 10)

5 Experiments

We evaluate the performance of BOAP method using synthetic benchmark function optimization problems and real-world optimization problems arising in advanced battery manufacturing processes. We have considered the following experimental settings for BOAP. We use the popular Automatic Relevance Determination (ARD) kernel [31] for the construction of both the rank GPs and the conventional (un-augmented) GPs. For rank GPs, we tune ARD kernel hyperparameters \(\theta _{d}=\{l_{d}\}\)using max-likelihood estimation (Eq. (5)). For the augmented GP modeling \(\tilde{\textbf{x}}\), we use a spatially varying kernel with a parametric lengthscale function (See discussion in Sect. 3.2). As we normalize the bounds, we tune \(l_{d}\) (the lengthscale for the un-augmented features) in the interval [0.1, 1] and the scale parameter \(\boldsymbol{\alpha }\) (for the auxiliary features) in the interval (0, 2]. Further, we set signal variance \(\sigma _{f}^{2}=1\) as we standardize the outputs.

We compare the performance of BOAP algorithm with the following state-of-the-art baselines. (i) BO-TS: a standard Bayesian Optimization (BO) with Thompson Sampling (TS) strategy, (ii) BO-EI: BO with Expected Improvement (EI) acquisition function, and (iii) BOAP - Only Augmentation (BOAP-OA): Here we run our algorithm without the 2-arm scheme and we only use augmented input for GP modeling. This method shows the effectiveness of expert’s inputs. We evaluate the performance of our method against the baselines by plotting the simple regret (\(\mathcal {R}_t\)) given by: \(\mathcal {R}_t=f(\textbf{x}^{\star })-\mathop {\max }\limits _{\textbf{x}\in \mathcal {D}_{1:t}}f(\textbf{x})\), where \(f(\textbf{x}^\star )\) is the true optima of the objective function. We do not consider any preference based BO methods [24, 26] as baselines, because the preferences are provided directly on the objective function, as opposed to abstract properties that are not measured directly. The additional details of our experimental setup are provided in the supplementary material (Sect. 11).

5.1 Synthetic Experiments

We evaluate BOAP framework in the global optimization of synthetic benchmark functions [32]. The list of synthetic functions used are provided in Table 1.

Emulating Preferential Expert Inputs: As discussed in Sect. 3.1, we fit a rank GP using the expert preferences provided on designs based on their cognitive knowledge. In all our synthetic experiments we set \(m=2\), i.e., we model two abstract properties \(\{\omega _{1},\omega _{2}\}\) for the considered synthetic function. We expect the expert to know the higher order abstract features of each design \(\textbf{x}\in \mathcal {X}\). We construct rank GPs by emulating the expert preferences based on such high level features of the given synthetic function. The possible set of high level features of the synthetic functions are mentioned in Table 1. We generate preference list \(P^{\omega _i}\) for each high level feature of the designs by comparing its utility. We start with \(p=\left( {\begin{array}{c}t'\\ 2\end{array}}\right) \) preferences in P, that gets updated in every iteration of the optimization process. We construct rank GP surrogates \(\{\mathcal{G}\mathcal{P}_{\omega _{1}},\mathcal{G}\mathcal{P}_{\omega _{2}}\}\) using \(P^{\omega _{1}}\) and \(P^{\omega _{2}}\).

For a given \(d-\)dimensional problem, we have considered \(t'=d+3\) initial observations and allocate \(T=10\times d+5\) budget. We repeat all our synthetic experiments 10 times with random initialization and report the average simple regret [12] (along with its standard error) as a function of iterations. The convergence plots obtained for the optimization of synthetic functions after 10 runs are shown in Fig. 3. It is evident from the convergence results that our proposed BOAP method has outperformed the standard baselines by a huge margin, thereby proving its superiority. Further, it is also observed that BOAP-OA, a BOAP variant without the 2-arm bandit strategy and just the augmented GP (\(\mathcal{G}\mathcal{P}_{h}\)), has a superior performance when compared to the baselines (BO-EI and BO-TS), thereby indicating the usefulness of expert inputs in significantly improving the performance of Bayesian optimization algorithm.

Table 1. Details of the synthetic optimization benchmark functions. Analytical forms are provided in the \(2^{\text {nd}}\) column and the last column depicts the high level features used by a simulated expert.
Fig. 3.
figure 3

Simple regret vs iterations for robustness experiments using synthetic multi-dimensional benchmark functions. We plot the average regret (along with its standard error) obtained after 10 random repeated runs.

To demonstrate the robustness of our approach we have conducted additional experiments by accounting for the inaccuracy or poor choices in expert preferential knowledge. Here, we show the robustness of our BOAP approach in two scenarios. First, we show the performance of our proposed approach when the higher order abstract properties are poorly selected. Second, we incorporate noise in the expert preferential feedback by flipping the expert preference between two inputs (designs) with a probability \(\delta \). We now discuss in detail the aforementioned two variations of our proposed method.

Inaccurate Abstract Properties (BOAP-IA). In the first variation, we assume that the expert poorly selects the human abstraction features. Table 2 depicts the synthetic functions considered and the corresponding (poorly chosen or uninformative) human abstraction features (\(\omega _1\) and \(\omega _2\)). BOAP-IA uses such inaccurate human abstract features while augmenting the original input space.

Table 2. Selection of abstract (uninformative) features by a simulated human expert. The human abstraction (high level) features shown in the 3rd column are deliberately selected to be uninformative.

Noisy Expert Preferences (BOAP-NP). In the second variation, we account for the inaccurate expert preferential knowledge by introducing an error in human expert preferential feedback. To do this, we flip the preference ordering with a probability \(\delta \) i.e., \(P^{\omega ,\delta }=\{(\textbf{x} _{i}\succ \textbf{x} _{j})\,|\,\textbf{x} _{i},\textbf{x} _{j}\in \textbf{x} _{1:n},\nu _{ij}\,\omega (\textbf{x} _{i})>\nu _{ij}\,\omega (\textbf{x} _{j})\)}, where \(\nu _{ij}\) is drawn from a random distribution such that it is \(+1\) with probability \(1-\delta \), \(-1\) with probability \(\delta \). In this set of experiments we have set the probability \(\delta =0.3\).

We evaluate the performance by computing the simple regret after \(10\times d\) iterations. The empirical results for BOAP with inaccurate features (BOAP-IA) and BOAP framework with noisy preferences (BOAP-NP) are presented in Fig. 4. Although the expert preferential knowledge is noisy and inaccurate, it is significant from the results that our proposed BOAP framework outperforms the standard baselines. We believe that the superior performance of BOAP variants is due to the model selection based safeguard mechanism that uses 2-arm scheme to intelligently select the arm with the maximum predictive likelihood to suggest the next sample.

Fig. 4.
figure 4

Simple regret vs iterations for the synthetic multi-dimensional benchmark functions. We plot the average regret (along with its standard error) obtained after 10 random repeated runs.

5.2 Real-World Experiments

We demonstrate the performance of BOAP in two real-world optimization use-cases in Lithium-ion battery manufacturing that are proven to be very complex and expensive in nature, thus providing a wide scope for the optimization. Further, battery scientists often reveal additional knowledge about the abstract properties in the battery design space and thus providing a rich playground for the evaluation of our framework. We refer to the supplementary material (Sect. 11.2) for the detailed experimental setup.

Optimization of Electrode Calendering. In this experiment, we consider a case study on the calendering process proposed in [33]. The authors analyzed the effect of parameters such as calendering pressure (\(\varepsilon _{\text {cal}}\)), electrode porosity and electrode composition on the electrode properties such as electrolyte conductivity, tortuosity (both in solid phase (\(\tau _{\text {sol}}\)) and liquid phase (\(\tau _{\text {liq}}\))), Current Collector (CC), Active Surface (AS), etc. We define an optimization paradigm using the data grid published in [33].

We use our proposed BOAP framework to optimize the electrode calendering process by maximizing the Active Surface of electrodes by modeling two abstract properties: (i) Property 1 (\(\omega _{1}\)): Tortuosity in liquid phase \(\tau _{\text {liq}}\), and (ii) Property 2 (\(\omega _{2}\)): Output Porosity (OP). We simulate the expert pairwise preferential inputs \(\{P^{\omega _{\tau _{\text {liq}}}}, P^{\omega _{\text {OP}}}\}\) by comparing the actual measurements reported in the dataset published in [33]. We consider 4 initial observations and maximize the active surface of the electrodes for 50 iterations. We compare the performance of our proposed BOAP framework by plotting the average simple regret (along with its standard error) after 10 repeated runs with random initialization. The convergence results obtained for the electrode optimization are shown in Fig. 5a.

Electrode Manufacturing Optimization. The best battery formulation and the optimal selection of process parameters is crucial for manufacturing long-life and energy-dense batteries. [34] analyzed the manufacturing of Lithium-ion graphite based electrodes and reported the process parameters in manufacturing a battery along with the output charge capacities of the battery measured after certain charge-discharge cycles. In our experiment, we use BOAP to optimally select the manufacturing process parameters to design a battery with maximum endurance i.e., a battery that can retain the maximum charge after certain charge-discharge cycles. We consider Anode Thickness (AT) and Active Mass (AM) as abstract properties \(\{\omega _{\text {AT}},\omega _{\text {AM}}\}\) to maximize the battery endurance \(E=\frac{D_{50}}{D_{5}}\), where \(D_{50}\) and \(D_{5}\) are the discharge capacities of the cell at \(50^{\text {th}}\) and \(5^{\text {th}}\) cycle, respectively. We consider 4 initial observations and maximize the endurance of the cell for 50 iterations. We compare the performance by plotting the average simple regret versus iterations after 10 random repeated runs. The convergence results obtained for maximizing the endurance is shown in Fig. 5b.

Fig. 5.
figure 5

Simple regret vs iterations for battery manufacturing optimization experiments: (a) Optimization of electrode calendering process (b) Optimization of the battery endurance.

It is evident from Fig. 5 that BOAP is superior to the baselines due to its ability to model the abstract properties of the battery designs that can be beneficial in accelerating BO performance. Similar to the trends observed in the synthetic experiments, BOAP-OA with just the augmented inputs has outperformed the standard baselines (BO-EI and BO-TS), thereby proving again the benefits of expert inputs in boosting the optimization performance. The supplementary material along with the necessary implementation details and the code snippets are available at https://github.com/mailtoarunkumarav/BOAP.

6 Conclusion

We present a novel approach for human-AI collaborative BO for modeling the expert inputs on abstract properties to further improve the sample-efficiency of BO. Experts provide preferential inputs about the abstract and unmeasurable properties. We model such preferential inputs using rank GPs. We augment the inputs of a standard GP with the output of such auxiliary rank GPs to learn the underlying preferences in the instance space. We use a 2-arm strategy, a key safeguard that provides assurance to utilize only relevant and accurate expert preferential inputs in the modeling, thus overcoming any futile expert bias. We discuss the convergence of our proposed BOAP framework. The experimental results show the superiority of our proposed BOAP algorithm.