1 Introduction

Modern engineering optimization problems usually adopt time-consuming numerical simulations to evaluate the objectives or constraints. For example, a post-buckling analysis of complex aerospace shells usually needs more than two hours (Hao et al. 2012; Hao et al. 2017). Therefore it is desired to limit the number of function evaluations. This requirement has popularized surrogate-based optimization method (Keshtegar et al. 2017; Hao et al. 2015), which uses the cheap mathematical approximation to replace the expensive true objective in the design process. In this area, the Gaussian process models, such as the Kriging surrogate (Krige 1951) and the statistical radial basis functions (Sóbester et al. 2005), have received a great attention, since they not only capture the unknown functions but also provide the probability estimation. With this merit, the efficient global optimization (EGO) methods adopt the statistical infill criteria to select the optima candidate, such as expected improvement (EI) (Jones et al. 1998; Jones 2001) and generalized EI (GEI) (Schonlau 1998; Sasena et al. 2002), which are able to avoid missing the global optimum to a large extent. Therefore, the EGO methods have been widely used in the modern engineering.

The real-world design problem usually involves more than one sub-objectives which conflict with on another, and this kind of problem is usually mentioned as the multi-objective optimization (MOO) problem. For the case that there is a lack of knowledge about design requirement, it is desired to present the designer multiple diversiform and competitive Pareto designs (also mentioned as Pareto solutions, Pareto frontier, and Pareto points) to decide the most appropriate one. The Pareto designs are defined as the points for which no other point is better in all sub-objectives. The optimization aiming at obtaining the Pareto designs is called Pareto optimization here. In earlier times, there are two main kinds of Pareto optimization approaches.

The first kind of Pareto optimization approach is mentioned as the single objective based approach in this paper, which transforms the MOO problem into different single objective ones and each of them leads to a Pareto point. For example, the weighted sum method (Messac et al. 2000) and the physical programing method (Messac et al. 2004) balance the multiple sub-objectives into one single objective function in a linear or non-linear way. The normal boundary intersection method (Das and Dennis 2006) and normal constraint method (Messac 2003) utilize a group of evenly distributed points on the utopia plane (formed by the optimal solutions of all sub-objectives) to produce different equality or inequality constraints in single objective problems.

The other kind of Pareto optimization approach is the heuristic intelligence algorithms, such as the genetic algorithm (GA) and particle swarming optimization (PSO), etc. The heuristic algorithms have received a great attention in this area, since they are naturally suitable for obtaining Pareto solutions. Some successful algorithms, such as the non-dominated sorting GA (Deb et al. 2002) and strength Pareto evolutionary algorithm (Kim et al. 2004), have been widely used. However, this kind of approaches typically require a massive amount of objective evaluations. Therefore, some researchers used the surrogate model to substitute a part of real simulations to save computational cost. For example, Goel et al. (2007) performed the multi-objective GA on the response surfaces of all the sub-objectives to obtain approximated Pareto frontier. Li et al. (2006) adopted Kriging model to replace a part of true objectives in the GA, and Kriging predicted error was employed to decide whether the prediction or the actual simulation should be performed to evaluate the design points. Jeong and Obayashi (2008) calculated the EI functions as the fitness values in the GA to find non-dominated solutions, and part of them were evaluated to update the Kriging model. Jie et al. (2016) developed the Kriging assisted PSO approach, in which the Kriging prediction took place of true objective and a normalized summation of EI values was used to determine the most promising designs to be evaluated by actual simulations.

In last two decades, the EGO method has also been extended for the Pareto optimization, and the study mainly focused on the statistical infill criteria for obtaining Pareto solutions. The ParEGO method was developed by Knowles (2006) whose basic idea is using the augmented Tchebycheff functions with different random vectors to balance the multiple sub-objectives into different single objectives. For each single objective, the Kriging surrogate is established, and the EI function is used to obtain new design point. Similar strategy has been adopted in a heuristic algorithm named MOEA/D by Liu et al. (2007) and Zhang et al. (2010).They established the Kriging surrogate for each sub-objective, and further derived the EI function of the balanced single objective based on the multivariate distribution. The balanced single objective could be a weighted sum function (Liu et al. 2007) or the augmented Tchebycheff function (Zhang et al. 2010). The Euclidean distance based EI (EI-ELU) was developed by Keane (2006), in which the expected central point of the probability distribution (provided by Kriging) in the non-dominated region is calculated. Then, the minimum distance between this central point and the obtain Pareto points is maximized. Zhan et al. (2017) defined the EI values rather than all the obtained Pareto points as the EI-Matrix. Then, based on this matrix, three infill criteria which are based on the summation distance, maximum distance and hypervolume were developed respectively.

One kind of statistical infill criterion is developed based on the hypervolume indicator which was firstly proposed by Zitzler and Thiele (1998) and has been proved as an outstanding metric for the quality of Pareto points distribution (Knowles et al. 2003; Emmerich et al. 2005). This indicator is defined as the volume of the dominated region of the Pareto points and bounded by a reference point. With this definition, the statistical infill criterion hypervolume based EI (HVEI) (Emmerich et al. 2011) selects the point with the maximum expected value of hypervolume growth as the new point. Wagner et al. (2010) compared several Pareto optimization infill criteria (before 2010) on a list of necessary and desired properties, and it is found that the HVEI has the best performance in terms of most properties but with a high computational cost. The computational complexity of HVEI mainly comes from the irregular geometry of the integration (non-dominated) region, and it is approximated with Monte Carlo method in earlier times. Emmerich et al. (2011) described a formula for calculating HVEI, in which the integration region is divided into a number of cells, and the integration in HVEI is performed in each one of them. However, the number of cells increases exponentially with the number of objectives. Couckuyt et al. (2014) developed a fast algorithm for HVEI, which allows a much lower number of integration cells, but no computational complexity was provided. As a rough approximation of HVEI, the lower confidence bound (LCB) (Emmerich et al. 2006) is defined as the hypervolume improvement of a lower confidence point. Since there is no need for the integral, the LCB has a much lower computational cost than HVEI. However, even the runtime of simply calculating the hypervolume still grows exponentially with the number of objectives (Beumea and Emmerich 2007). Zhan et al. (2017) estimated that the infill criterion function could be called for two millions times in the whole optimization process. If one time evaluation of the infill criterion costs one second, the total evaluating time will be twenty three days which might be even longer than the true simulations. Therefore, although the HVEI shows great potential abilities on generating well distributed Pareto points, the high computational complexity limits its application on many objective problems.

In this paper, a modified HVEI (MHVEI) function is developed, which can be considered as a weighted integral form of HVEI. Compared with the original HVEI, this new statistical criterion is easier to implement, maintains all the desired properties, and has a much lower computational cost. This paper is organized as follows. In Section 2, some basic knowledge about the EGO method and HVEI are introduced. In Section 3, the MHVEI is developed and studied theoretically by comparing it with HVEI on a list of desired properties. In Section 4, the performance of MHVEI is investigated with some mathematical test functions.

2 Related works

In this section, the classic EGO method for the single objective problem and the original HVEI function will be introduced.

2.1 Efficient global optimization

2.1.1 Ordinary kriging (OK) surrogate

The OK surrogate is a widely used Gaussian process model. Considering the objective function y(x) is meant to be minimized where \( \mathbf{x}\mathbf{\in}\mathbb{X} \) denotes the design variables and \( \mathbb{X}\subset {\mathbb{R}}^d \) denotes the design variable space. The OK surrogate assumes that the unknown y(x) meets the normal distribution for any \( \mathbf{x}\mathbf{\in}\mathbb{X} \)

$$ y\left(\mathbf{x}\right)\sim \mathcal{N}\left(\beta, {\sigma}^2\right) $$
(1)

where β and σ2 are the assumed values of mean and variance. For any two points \( {\mathbf{x}}_i\mathbf{\in}\mathbb{X} \) and \( {\mathbf{x}}_j\mathbf{\in}\mathbb{X} \),

$$ \mathit{\operatorname{cov}}\left[y\left({\mathbf{x}}_i\right),y\left({\mathbf{x}}_j\right)\right]={\sigma}^2r\left({\mathbf{x}}_i,{\mathbf{x}}_j,\boldsymbol{\uptheta} \right)={\sigma}^2\mathit{\exp}\left[{\sum}_{k=1}^d-{\theta}_k{\left({x}_{i,k}-{x}_{j,k}\right)}^2\right] $$
(2)

where r denotes a given correlation model function, and θ = {θi| 1 ≤ i ≤ d} denotes the correlation parameters on different dimensions. In (2) the Gaussian correlation model is adopted.

The OK surrogate is established based on a group of evaluated design variables \( \mathbf{S}=\left\{{\mathbf{S}}_i|{\mathbf{S}}_i\mathbf{\in}\mathbb{X},1\le i\le N\right\} \) with their objective values of Y = y(S). In this paper, S and Y will be mentioned as the training data of OK surrogate, and N denotes the number of training data. If the (1) and (2) are considered as the priori hypotheses, and Y = y(S) is taken as the known event, the conditional probability distribution of y(x) for any \( \mathbf{x}\mathbf{\in}\mathbb{X} \) will be derived based on the Bayes theory that

$$ \left[\left.y\left(\mathbf{x}\right)\right|y\left(\mathbf{S}\right)=\mathbf{Y}\right]\mid \sim \mathcal{N}\left[\widehat{y}\left(\mathbf{x}\right),\widehat{\sigma}{\left(\mathbf{x}\right)}^2\right] $$
(3)

where \( \widehat{y}\left(\mathbf{x}\right) \) and \( \widehat{\sigma}{\left(\mathbf{x}\right)}^2 \) are the predicted mean and variance functions of OK surrogate separately that

$$ \widehat{y}\left(\mathbf{x}\right)=\frac{{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{Y}}{{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{1}}+\mathbf{r}{\left(\mathbf{x}\right)}^T{\mathbf{R}}^{-1}\left(\mathbf{Y}-\frac{{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{Y}}{{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{1}}\mathbf{1}\right) $$
(4)
$$ \widehat{\sigma}{\left(\mathbf{x}\right)}^2={\sigma}^2\left\{1+\frac{{\left[1-{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{r}\left(\mathbf{x}\right)\right]}^2}{{\mathbf{1}}^T{\mathbf{R}}^{-1}\mathbf{1}}-\mathbf{r}{\left(\mathbf{x}\right)}^T{\mathbf{R}}^{-1}\mathbf{r}\left(\mathbf{x}\right)\right\} $$
(5)

where r(x) = {r(x, S1, θ), r(x, S2, θ), ⋯r(x, Sn, θ)}T denotes the correlation vector between x and S, 1 denotes the vector with n components of 1, and R=[r(Si, Sj, θ)]i, j denotes correlation matrix of S. Based on (3), the probability distribution function (PDF) provided by OK surrogate can be written as

$$ p\left[y\left(\mathbf{x}\right)\right]=\phi \left[\frac{y\left(\mathbf{x}\right)-\widehat{y}\left(\mathbf{x}\right)}{\widehat{\sigma}\left(\mathbf{x}\right)}\right] $$
(6)

where ϕ denotes the probability density function of standard normal distribution.

The parameters of OK model including β, σ2 and θ are usually estimated by maximizing the likelihood function that

$$ {\max}_{\beta, {\sigma}^2,\boldsymbol{\uptheta}, }p\left[y\left(\mathbf{S}\right)=\mathbf{Y}\right]=\frac{1}{\sqrt{{\left(2\pi \right)}^n\left|\mathbf{R}\right|}}\mathit{\exp}\left[\frac{-1}{2{\sigma}^2}{\left(\mathbf{Y}-\beta \mathbf{1}\right)}^T{\mathbf{R}}^{-1}\left(\mathbf{Y}-\beta \mathbf{1}\right)\right] $$
(7)

More details about the derivation of OK surrogate can be found in the literature (Ginsbourger et al. 2010) which will not be further introduced.

2.1.2 Generalized EI function

During the EGO process, the best obtained objective design is selected from the training data that Ymin = min{Y}. Then, for any unevaluated design x, the improvement rather than Ymin is defined as

$$ I\left(\mathbf{x}\right)=\max \left\{0, Ymin-y\left(\mathbf{x}\right)\right\} $$
(8)

The probability distribution of I(x) can be obtained from (6). Then, the EI method takes the design with the maximal expected value of I(x) as the infilled point

$$ EI\left[y\left(\mathbf{x}\right), Ymin\right]=E\left[I\left(\mathbf{x}\right)\right]={\int}_0^{\infty } Ip(I) dI=u\left(\mathbf{x}\right)\Phi \left[\frac{u\left(\mathbf{x}\right)}{\widehat{\sigma}\left(\mathbf{x}\right)}\right]+\sigma \left(\mathbf{x}\right)\upphi \left[\frac{u\left(\mathbf{x}\right)}{\widehat{\sigma}\left(\mathbf{x}\right)}\right] $$
(9)

where \( u\left(\mathbf{x}\right)= Ymin-\widehat{y}\left(\mathbf{x}\right) \) and Φ denotes the cumulative distribution function of standard normal distribution.

Note that \( {\widehat{\sigma}}^2 \) is the expected difference between y and \( \widehat{y} \) that

$$ {\widehat{\sigma}}^2=E\left[{\left(y-\widehat{y}\right)}^2\right] $$
(10)

therefore \( \widehat{\sigma}\left(\mathbf{x}\right) \) is also mentioned as uncertainty, and used to estimate the accuracy of \( \widehat{y}\left(\mathbf{x}\right) \). The EI function prefers to select the point with a higher \( \widehat{\sigma} \) or a lower \( \widehat{y} \). The former preference is usually mentioned as global exploration which is meant to improve the accuracy of the surrogate, and the latter one is named as local exploitation which is meant to search the optimum of the surrogate. These two searching preferences can avoid missing the global optimum to a large extent, and they can be further balanced in the GEI function (Schonlau 1998; Sasena et al. 2002) as

$$ E\left[{I}^g\left(\mathbf{x}\right)\right]={\int}_0^{\infty }{I}^gp(I) dI={\int}_0^{\infty }{I}^{g-1} Ip(I)I $$
(11)

where the positive integer g is used to control the balance. If the Ig − 1 is considered as the integral weight coefficient of the improvement, one can observe that there is a trade-off in choosing between small improvement with large probability (local exploitation) versus large improvement with small probability (global exploration) (Schonlau 1998). As g grows, the balance will incline to global exploration which is more suitable when the unknown objective is poorly approximated (Sasena et al. 2002), otherwise it will incline to local exploitation.

2.2 Hypervolume based expected improvement

Considering there are M objectives y(x) = {y1(x), y2(x), ⋯yM(x)} to be minimized, and all the sub-objectives can be evaluated by one time simulation. In most cases, the dependency between different sub-objective functions is not available, therefore a standard assumption is usually made that all the sub-objectives are independent to one another (Emmerich et al. 2006). One can use the Kriging surrogate to estimate the probability distribution of each sub-objective separately, and the PDF of y(x) is expressed as

$$ p\left[\mathbf{y}\left(\mathbf{x}\right)\right]={\prod}_{i=1}^M\phi \left[\frac{y_i\left(\mathbf{x}\right)-{\widehat{y}}_i\left(\mathbf{x}\right)}{{\widehat{\sigma}}_i\left(\mathbf{x}\right)}\right] $$
(12)

In each iteration of the EGO process for MOO, the current Pareto points P ={Pi| 1 ≤ i ≤ n} are selected from the training data Y. The hypervolume indicator is an outstanding metric for the quality of Pareto points distribution, which is defined as the volume of the region dominated by P and bounded by a reference point r ∈M. It can be mathematically expressed as

$$ H\left(\mathbf{P}\right)={\int}_{\mathbf{a}\prec \mathbf{r}\wedge \exists {\mathbf{P}}_i\mathbf{\in}\mathbf{P}:{\mathbf{P}}_i\prec \mathbf{a}}d\mathbf{a} $$
(13)

The reference point r should be set with a relative large value to ensure that it is dominated by any true Pareto point. A higher value of hypervolume indicator means a better distribution of Pareto points, therefore the problem of Pareto optimization can be transformed to maximizing H(P). Then, for a given point y ∈ Ω, when it is added to the training data, the hypervolume improvement is defined as the growth of H(P) that

$$ {I}_{hv}\left(\mathbf{y}\right)=H\left(\mathbf{P}\cup \mathbf{y}\right)-H\left(\mathbf{P}\right) $$
(14)

Note that Ihv(y)≥ 0, and Ihv(y)>0 when and only when y is not strongly dominated by any Pareto point in P. As shown in Fig. 1, Ihv(y) is the area of the shaded part. Similar to the classic EI, the HVEI method takes the point with the maximum expected value of Ihv(y) as the new point that

$$ {EI}_{hv}\left(\mathrm{x}\right)=E\left[{I}_{hv}\left(\mathrm{y}\left(\mathrm{x}\right)\right)\right]={\int}_{y\in \Omega}{I}_{hv}\left(\mathrm{y}\right)p\left(\mathrm{y}\right)d\mathrm{y} $$
(15)

where Ω denotes the non-dominated region

$$ \boldsymbol{\Omega} =\left\{\mathbf{a}\in {\mathbb{R}}^M|\mathbf{a}\prec \mathbf{r}\wedge \nexists {\mathbf{P}}_i\in \mathbf{P}:{\mathbf{P}}_i\prec \mathbf{a}\right\} $$
(16)
Fig. 1
figure 1

Reference point, Pareto points and hypervolume improvement

The HVEI is monotonically increasing with increasing \( {\widehat{\sigma}}_i \) (i = 1, 2, ⋯M), therefore it preserves the global exploration of EI method to improve the accuracy of surrogate. And when the accuracy of the surrogate is high (which means \( {\widehat{\sigma}}_i \) is low), the EIhv(x) will regress to Ihv(y) which is directly defined to maximize the hypervolume. Therefore, HVEI is a very effective statistical infill criterion for producing well-distributed Pareto points.

Unfortunately, because of the irregular shape of Ω, there is a lack of direct calculating formula of both Ihv(y) and EIhv(x) when M > 2 (a formula of HVEI for bi-objective problem is developed in Section 3.2), and the high computational complexity of HVEI (which will be further introduced in Section 3.4) has limited its application on the many objective problems.

3 Modified hypervolume based expected improvement

In this section, a new statistical criterion for Pareto optimization will be developed. In order to make it clear, the related concepts are given first in Section 3.1. In Section 3.2, a modified version of hyper volume improvement is proposed and compared with the original one, and then the MHVEI is derived and the whole optimization process is introduced in Section 3.3. In Section 3.4, the properties of MHVEI will be discussed theoretically.

3.1 Related concepts

  • Local upper bounds (LUBs) U: U ⊂ M denotes a set of points which cannot be dominated by one another that ∀Ui∈ U, ∀ Uj∈ UUi ⊀ Uj. And it can be used to define the non-dominated region that

$$ \boldsymbol{\Omega} =\left\{\mathbf{a}\in {\mathbb{R}}^M|\exists {\mathbf{U}}_i\in \mathbf{U}:\mathbf{a}\prec \prec {\mathbf{U}}_i\right\} $$
(17)
  • Grid cells: Using the jth component of each Pi to divide the jth dimension of M into n + 1 intervals, then all the intervals on different dimensions will form (n + 1)M hyper boxes. All the hyper boxes whose lower bounds are inside Ω are defined as the grid cells. In this paper, li = (li, 1, li, 2, ⋯li, M) and ui = (ui, 1, ui, 2, ⋯ui, M) are used to denote the lower and upper bounds of the ith grid cell, where li, j and ui, j denote the bounds corresponding to the jth sub-objective function.

More details of the LUBs can be found in the work of Klamroth et al. (2015) which will not be further introduced. The grid cells have been adopted by Emmerich et al. (2011) to calculate HVEI. Figure 2 shows the relationship between P and U, and the grid cells are denoted by the dotted lines. Note, every LUB is the upper limit of a grid cell.

Fig. 2
figure 2

LUBs, grid cells and modified hypervolume improvement

3.2 Modified hypervolume improvement function

In EGO method, the improvement function should be defined based on the design purpose. For example, maximizing I(x) is the same to minimizing the single objective function, and maximizing Ihv is the same to maximizing the hypervolume. In other words, the improvement function for the MOO problem should be able to produce well distributed Pareto points, when it is used as the single objective function balancing the multiple sub-objectives (like the methods weighted sum and physical programming, see Section 1). Three desired properties of the improvement function for Pareto optimization are listed as follows:

  1. Porperty 1.

    The improvement function should be monotonically decreasing with increasing yj(j = 1, 2, ⋯M).

  2. Porperty 2.

    When P ⊂ Ptrue, the improvement function should prefer the point which mostly augments the current Pareto frontier.

  3. Porperty 3.

    A simple and integrable formula of the improvement function is desired.

Among these properties, Property 1 can ensure the maximum point of improvement function will always locate on Ptrue, Property 2 relates to the quality of the Pareto points distribution, and Property 3 is meant to ensure that the direct calculating formula of the statistical infill criterion exists.

Because maxIhv(y) ⟺ max H(P ∪ y), the Ihv satisfies the first two properties perfectly but fails on the last one. By adopting the concept of the LUBs, the Ihv(y) can be written in the form of

$$ {I}_{hv}\left(\mathbf{y}\right)={\sum}_{i=1}^mV\left(\left[\mathbf{y},{\mathbf{U}}_i\right]\right)-{V}_{overlap}\left(\mathbf{y}\right) $$
(18)

where the first term has considered the volume of the whole region dominated by y, and the Voverlap denotes the volume of the overlapped part in the first item. Note, for a given Ui∈ U, if yUi, [y, Ui] = ∅ and V([y, Ui]) = 0. As shown in Fig. 2, the area with different shaded lines denotes the region of different [y, Ui] in the first term, and the overlapped shaded area relates to the value of Voverlap.

For the bi-objective problem, every Pareto point dominates two LUBs, then Voverlap(y) can be calculated as

$$ {V}_{overlap}\left(\mathbf{y}\right)={\sum}_{i=1}^nV\left(\left[\mathbf{y},{\mathbf{P}}_i\right]\right) $$
(19)

Substituting (18) and (19) into (15), the formula of HVEI can be written as

$$ {EI}_{hv}\left(\mathbf{x}\right)={\sum}_{i=1}^m\left[{\prod}_{j=1}^M EI\left({y}_j,{U}_{i,j}\right)\right]-{\sum}_{i=1}^n\left[{\prod}_{j=1}^M EI\left({y}_j,{P}_{i,j}\right)\right] $$
(20)

Unfortunately, (19) dose not hold true when M > 2, therefore the HVEI still lacks a calculating formula. However, if we defined the first term of (18) as an improvement function, it will preserve all the three properties above. This new function is named as the modified version of hypervolume improvement which is given as

$$ {I}_{hv}^{mod}\left(\mathbf{y}\right)={\sum}_{i=1}^mV\left(\left[\mathbf{y},{\mathbf{U}}_i\right]\right) $$
(21)

and the three properties of \( {I}_{hv}^{mod} \) will be discussed in the following.

Obviously, \( {I}_{hv}^{mod} \) satisfies Property 1 and Property 3 based on its definition, see (21). In order to discuss Property 2 of \( {I}_{hv}^{mod} \), the difference between Ihv and \( {I}_{hv}^{mod} \) will be first analyzed by adopting the concept of grid cells. In the ith grid cell, the part dominated by y can be expressed as

$$ \left[\mathbf{y},{\mathbf{u}}_i\right]\bigcap \left[{\mathbf{l}}_i,{\mathbf{u}}_i\right]=\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right] $$
(22)

where max{y, li} = (max{y1, li, 1}, max{y2, li, 2}, ⋯ max {yM, li, M}). Then Ihv(y) can be expressed as the summation of dominated region in all the grid cells

$$ {I}_{hv}\left(\mathbf{y}\right)={\sum}_iV\Big(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right] $$
(23)

The intersection of [y, Uj] and the ith grid cell can be written as

$$ \left[\mathbf{y},{\mathbf{U}}_j\right]\bigcap \left[{\mathbf{l}}_i,{\mathbf{u}}_i\right]=\left\{\begin{array}{c}\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right],\mathrm{if}\ {\mathbf{l}}_i\prec \prec {\mathbf{U}}_j\\ {}\varnothing, \mathrm{else}\ \end{array}\right. $$
(24)

then similar to (23), \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) can be rewritten as

$$ {I}_{hv}^{mod}\left(\mathbf{y}\right)={\sum}_{j=1}^mV\left(\left[\mathbf{y},{\mathbf{U}}_j\right]\right)={\sum}_{j=1}^m{\sum}_i^{{\mathbf{l}}_i\prec \prec {\mathbf{U}}_j}V\left(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right]={\sum}_i{g}_iV\right(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right] $$
(25)

where gi denotes the number of LUBs strongly dominated by li. By comparing (23) and (25), we can get that when y dominates more LUBs, the difference between Ihv(y) and \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) will become more obvious. In the contrary, if y dominates only one LUB, \( {I}_{hv}\left(\mathbf{y}\right)={I}_{hv}^{mod}\left(\mathbf{y}\right) \). It implies that Ihv(y) and \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) will have a more similar landscape when y is closer to the LUBs.

When P ⊂ Ptrue, all the LUBs will be close to Ptrue, then \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) will have a similar ability of augmenting Pareto frontier as Ihv(y) because of the similar landscape. Note, when P ⊂ Ptrue and M = 2, any point y ∈ Ptrue can strongly dominate only one (at most) LUB, therefore \( {I}_{hv}\left(\mathbf{y}\right)={I}_{hv}^{mod}\left(\mathbf{y}\right) \).It means that \( {I}_{hv}^{mod} \) will work the same as Ihv on augmenting the Pareto frontier for the bi-objective case.

Generally speaking, \( {I}_{hv}^{mod} \) shows better than Ihv on Property 3. And based on the discussion above, \( {I}_{hv}^{mod} \) can also produce well distributed Pareto points (Property 1 and Property 2). The quality of Pareto points obtained by \( {I}_{hv}^{mod} \) and Ihv will be compared with numerical tests in Section 4.1.

3.3 Optimization with MHVEI

Based on (17) and (21), \( {I}_{hv}^{mod}\left(\mathbf{y}\right)\ge 0 \), and \( {I}_{hv}^{mod}\left(\mathbf{y}\right)>0 \) when and only when y ∈ Ω. Then similar to the HVEI, the expected value of \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) is defined as the MHVEI that

$$ {EI}_{hv}^{mod}\left(\mathbf{x}\right)=E\left[{I}_{hv}^{mod}\left(\mathbf{y}\left(\mathbf{x}\right)\right)\right]={\sum}_{i=1}^mE\left[V\left(\left[\mathbf{y},{\mathbf{U}}_i\right]\right)\right]={\sum}_{i=1}^m\left[{\prod}_{j=1}^M EI\left({y}_j,{U}_{i,j}\right)\right] $$
(26)

where EI(yj, Ui, j) is the EI function, see (9).

The optimization with the MHVEI follows a typical process of EGO method as shown in Fig. 3. At the very beginning, the initial training data is usually generated with Latin Hypercube method (Mckay et al. 1979). Before the optimization, the initial LUBs are obtained based on initial P and r, and they will be updated with new design y(x) in each iteration. An efficient algorithms for generating and updating the LUBs has been proposed by Przybylski et al. (2010), and further improved by Klamroth et al. (2015). In their tests, updating the LUBs with one new point takes no more than 10−5 second when m = 10000 and M = 6, therefore the computational cost can be ignored. The Matlab code of Przybylski’s approach can be found in the Appendix 1.

Fig. 3
figure 3

Optimization process with MHVEI

3.4 Properties of MHVEI

In the following, the performance of MHVEI will be discussed on several desired properties of Pareto infill criterion. All these properties are summarized from the work of Wagner et al. (2010)

  • Global exploration and local exploitation

The statistical infill criterion should be monotonically decreasing with increasing \( {\widehat{y}}_i \), and monotonically increasing with increasing \( {\widehat{\sigma}}_i \) (i = 1, 2, ⋯M). Obviously this property has been preserved in MHVEI, since it is monotonically increasing with increasing EI(yj, Ui, j).

In addition, the balance between global exploration and local exploration is the main difference between HVEI and MHVEI. If we substituting (23) and (25) into (15) and (26) separately, we will get

$$ {EI}_{hv}\left(\mathbf{x}\right)={\int}_{\boldsymbol{y}\mathbf{\in}\boldsymbol{\Omega}}\left[{\sum}_iV\Big(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right]\right]p\left(\mathbf{y}\right)d\mathbf{y}={\sum}_i{\int}_{\boldsymbol{y}\mathbf{\in}\boldsymbol{\Omega}}V\left(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right]\right)p\left(\mathbf{y}\right)d\mathbf{y} $$
(27)
$$ {EI}_{hv}^{mod}\left(\mathbf{x}\right)={\int}_{\boldsymbol{y}\mathbf{\in}\boldsymbol{\Omega}}\left[{\sum}_i{g}_iV\Big(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right]\right]p\left(\mathbf{y}\right)d\mathbf{y}={\sum}_i{g}_i{\int}_{\boldsymbol{y}\mathbf{\in}\boldsymbol{\Omega}}V\left(\left[\max \left\{\mathbf{y},{\mathbf{l}}_i\right\},{\mathbf{u}}_i\right]\right)p\left(\mathbf{y}\right)d\mathbf{y} $$
(28)

If gi is considered as the weight coefficient, \( {EI}_{hv}^{mod} \) will be the weighted integral form of EIhv, and \( {EI}_{hv}^{mod} \) will prefer the region which dominates more LUBs comparing with EIhv. Usually, if y dominates more LUBs, it will have a higher Ihv(y) but with a smaller probability of p(y). Then, based on the discussion above, we can conclude that comparing with HVEI, MHVEI prefers the design point with a higher improvement Ihv(y) but a smaller probability of p(y). This difference is similar to the one between GEI functions (Schonlau 1998) with different orders, see Section 2.1. Therefore, we can also conclude that the balance (between the global exploration and local exploitation) in MHVEI more inclines to the global exploration than HVEI.

  • Improving the Pareto set distribution when \( \widehat{\boldsymbol{\upsigma}} \) is small

The HVEI shows outstanding advantage on this property, because it is defined to directly maximize the hypervolume. When \( \widehat{\boldsymbol{\upsigma}} \) is small, the statistical infill criterion will regress to the improvement function. As discussed in Section 3.2, the \( {I}_{hv}^{mod} \) has a very similar ability of producing well distributed Pareto points as Ihv, therefore MHVEI can show a similar performance as HVEI on improving the Pareto set distribution. In order to demonstrate this, the approach of Wagner et al. (2010) is adopted. Note, both MHVEI and HVEI are functions with respect to \( \widehat{\mathbf{y}} \) and \( \widehat{\boldsymbol{\upsigma}} \). The ability of statistical infill criterion on improving the distribution depends on the fitness landscapes with respect to varied \( \widehat{\mathbf{y}} \). Therefore the functions of both criteria with respect to varied \( \widehat{\mathbf{y}} \) and constant \( \widehat{\boldsymbol{\upsigma}} \) (\( \widehat{\boldsymbol{\upsigma}}=\mathrm{0,0.1} \) and 0.2) on bi-objective case are compared in Fig. 4. For each \( \widehat{\boldsymbol{\upsigma}} \), a group of Pareto points with both convex and concave shapes are generated. It can be observed that, although the two criteria refer to very different scale, they have very similar fitness landscapes.

Fig. 4
figure 4

Fitness landscape on bi-objective case

  • Easy to search

The infill criterion should be continuous, differentiable and the plateaus of its fitness landscape should be avoided, then the direct optimizer (especially the gradient based methods) can easily reach the local optima of the infill criterion. Clearly the MHVEI is continuous and differentiable since it has an analytical expression. According to Wagner et al. (2010), HVEI shows strong gradients to their local optima by checking its fitness landscapes, therefore the comparison in Fig. 4 also implies that the MHVEI preserves this property as well as HVEI.

  • Easy to implement and efficient to calculate

The MHVEI is easy to implement, since the direct calculating formula has been given in (26). By observing (26), the computing amount of MHVEI is proportional to M ∗ m. Clearly, m = n + 1 for the bi-objective case, and m is tightly upper bounded by O(nM/2⌋) when M > 2 (Klamroth et al. 2015). Therefore it can be got that the MHVEI has a computational complexity of O(M ∗ nM/2⌋) in the worst case.

In Kriging model, predicting \( {\widehat{\sigma}}_i \) involves a matrix-vector multiply operation, and the computational complexity is O(N2) where N denotes the number of training data. Since P ⊂ Y, n ≤ N. Therefore MHVEI has the same computational complexity as the Kriging model when M ≤ 5 (the computational complexity is O(n2) when M = 5).

The computational complexity of MHVEI are compared with the fastest algorithms (as far as we know) of HVEI in Table 1, where the algorithm for bi-objective case has been developed in Section 3.2, see (20). The algorithms of HVEI for the problem when M = 3 and M > 3 were developed by Yang et al. (2017) and Couckuyt et al. (2014) separately. Note, Couckuyt et al. (2014) have not provided the computational complexity in their work, but it can be estimated as O(M ∗ q2) based on their calculating formula, where q denotes the number of multiple disjoint cells covering the whole region of Ω. Based on (17), it can be deduced that m is the smallest number of cells which can cover the whole Ω, therefore q ≥ m, and the computational complexity will be O(M ∗ n2⌊M/2⌋) in the best case.

Table 1 Comparison on computational complexity

Generally speaking, the Table 1 shows that MHVEI has an obvious advantage on the computational complexity comparing with HVEI when M is large, but it is still more time-consuming than the other infill criteria which is not developed based on the hypervolume indicator.

4 Numerical tests

The performance of MHVEI will be investigated with mathematical test functions. The hypervolume indicator is used to measure the obtained Pareto solutions, see (13). All the programs used in this paper are coded with Matlab. The Kriging model is established with DACE tool box (Nielsen et al. 2002). The new design point is obtained by a hybrid strategy of PSO algorithm and gradient based method. To be specific, the PSO is used first to obtain a solution, and then this solution and all the current Pareto designs are used as the initial guesses in the “fmincon” function provided by Matlab. The HSO algorithm (While et al. 2005) is adopted to calculate the hypervolume indicator.

Six test functions from previous literatures are used in this paper. For the bi-objective case, the test functions of ZDT1, ZDT2 and ZDT3 are adopted, which have different shape of Pareto frontier, (convex, concave and discontinue). For the higher M, the test functions of DTLZ1, DTLZ2 and DTLZ7 (Deb et al. 2001) are adopted, whose Pareto frontier are linear, concave and discontinue. The mathematical expression of the test functions can be found in Appendix 2. Note, all the test functions with the same d have the same design variable space of \( \mathbb{X}=\left\{\mathbf{x}|0\le {x}_i\le 1,i=1,2,\cdots d\right\} \). In order to average out the randomness caused by the initial samples, fifty groups of Latin Hypercube Samples are generated for each \( \mathbb{X} \), and the number of initial samples always equals10*d. For all the test functions, it is set that d = M + 1.

4.1 Comparison between improvement functions

The definitions of improvement function are important for the Pareto statistical infill criteria, and three desire properties have been proposed in Section 3.2. In order to further investigate the first two properties (generating well distributed Pareto points) of \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \), \( {I}_{hv}^{mod}\left(\mathbf{y}\right) \) and Ihv(y) will be compared on the DTLZ test functions (M = 3 and 4) in Section 4.1. To be specific, both the improvement functions are directly used to obtain the infill design point that

$$ {\max}_{\mathbf{x}}\ {I}_{hv}^{mod}\left[\mathbf{y}\left(\mathbf{x}\right)\right] $$
(29)

and

$$ {\max}_{\mathbf{x}}\ {I}_{hv}\left[\mathbf{y}\left(\mathbf{x}\right)\right] $$
(30)

where y(x) denotes the true objective functions. Note, the (29) and (30) can be understood as the MHVEI function and HVEI function with setting \( \widehat{\mathbf{y}}=\mathbf{y} \) and \( \widehat{\boldsymbol{\upsigma}}=0 \). The obtained values of hypervolume are compared in Fig. 5, and the Fig. 6 shows the difference (the hypervolume obtained by Ihv minus the one obtained by \( {I}_{hv}^{mod} \)).

Fig. 5
figure 5

Comparison of improvement functions

Fig. 6
figure 6

Hypervolume difference

The results show that the Ihv is able to achieve a higher hypervolume than \( {I}_{hv}^{mod} \). This is understandable, since maximizing Ihv is exactly the same to maximizing the hypervolume. However, the two curves shown in Fig. 5 basically overlap with each other, and the advantage of Ihv shown in Fig. 6 can be almost ignored. This comparison proves that \( {I}_{hv}^{mod} \) can produce Pareto points distributed as well as Ihv, and it also implies that the MHVEI works similar as HVEI in the absence of uncertainty.

In addition, Ihv lacks a calculating formula when M > 3, and the computational complexity also grows exponentially with M. Therefore the \( {I}_{hv}^{mod} \) can be used to replace the Ihv in the heuristic intelligence algorithms such as the SMS-EMOA (Beumea and Emmerich 2007), when the computational cost becomes a matter of concern.

4.2 Comparison with other infill criteria

In this section, the MHVEI is compared with three existing infill criteria including HVEI, EI-ELU and LCB. The EI-ELU is developed by Keane (2006) and defined as

$$ {EI}_{elu}\left(\mathbf{x}\right)=P\left[{I}_{hv}\left(\mathbf{y}\right)>0\right]\min \left\{\left\Vert {\mathbf{y}}^{\ast}-{\mathbf{P}}_i\right\Vert |1\le i\le n\right\} $$
(31)

where P[Ihv(y) > 0] denotes multiple objective probability improvement, and y is the expected centroid point of y in Ω. The LCB can be understood as the hypervolume improvement of a lower confidence point that \( {I}_{hv}\left(\widehat{\mathbf{y}}-w\widehat{\boldsymbol{\upsigma}}\right) \) where w is the gain factor. In this work, w is set with the value suggested by Emmerich et al. (2006). All these criteria can be easy to implement with a low computational cost on the bi-objective problem, but the computational complexity of both EI-ELU and LCB also grows exponentially with M. With the consideration of computational cost, the MHVEI will be compared with all the three criteria on the ZDT functions (M = 2), and it will be compared with only HVEI on the DTLZ functions (M = 3, 4).

4.2.1 Comparison on ZDT functions

The averaged hypervolume obtained by MHVEI, HVEI, EI-ELU and LCB are compared in Fig. 7, and the results can be discussed with the desired properties mentioned in Section 3.3.

Fig. 7
figure 7

Comparison on ZDT functions

The EI-ELU performs worst among all the four criteria in most cases, which can be attributed to that the EI-ELU is not monotonically decreasing with increasing \( {\widehat{y}}_i \). Comparing with HVEI and MHVEI, the LCB can achieve a similar hypervolume at the end of the optimization, but the hypervolume grows slower during the process. Note, the LCB balances the global exploration and local exploitation roughly in a linear way in the purpose of reducing the computational cost (the LCB can be faster calculate than the HVEI). However the two searching preferences have a strong effect on both the convergence rate and global searching ability of the statistical infill criterion.

The HVEI and MHVEI perform similar on the ZDT functions, and they will be further compared below.

4.2.2 Comparison on DTLZ functions

Four groups of DTLZ functions with different number of sub-objectives and design variables are adopted. The averaged hypervolume and the relative variance are shown and compared in Figs. 8, 9, 10 and 11. Generally speaking, MHVEI performs no worse than HVEI in all the cases, and the advantage of MHVEI is quite obvious on the DTLZ1 and DTLZ2 with d > 4. The reason can be explained by the difference between the two infill criteria. When d is larger, the unknown sub-objectives will be more difficult to approximate by the Kriging model. As demonstrated in Section 3.4, comparing with HVEI, MHVEI prefers the point with higher \( \widehat{\boldsymbol{\upsigma}} \), and it is more helpful to improve the accuracy of the surrogate.

Fig. 8
figure 8

Comparison on DTLZ functions with M = 3 and d = 4

Fig. 9
figure 9

Comparison on DTLZ functions with M = 3 and d = 5

Fig. 10
figure 10

Comparison on DTLZ functions with M = 4 and d = 5

Fig. 11
figure 11

Comparison on DTLZ functions with M = 4 and d = 6

4.3 Investigation on Higher Number of Objectives

Figure 12 shows the Pareto points obtained by MHVEI during the optimization process with the first (of fifty) group of initial samples. It can be observed that the Pareto points keep exploring Ptrue as the optimization continues. However, one can image that when M is larger, much more Pareto points will be needed to cover the higher dimensional space (the dimension of Ptrue is M − 1 in our tests).

Fig. 12
figure 12

Pareto points obtained by MHVEI in the first 200 iterations

In this section, the performance of MHVEI on many objective problem (M ≥ 3) will be investigated. In order to eliminate the effects of dimension on the hypervolume indicator, the normalized hypervolume defined as H(P)/H(Ptrue) is used as the metric to measure Pareto points distribution. Only the test functions of DTLZ1 and DTLZ2 are adopted here, because the H(Ptrue) can be calculated analytically with arbitrary value of M. The optimization process in the first 200 iterations with M = 3, 4,⋯ 7 and d = M + 1 are compared.

Note that H(P)/H(Ptrue) ≤ 1, and H(P)/H(Ptrue) = 1 when and only when P = Ptrue. Figure 13 shows that the algorithm works well at the end of optimization when M is small (M ≤ 4 for DTLZ1, and M ≤ 6 for DTLZ2), but the hypervolume grows slower as M increases. It means that for the higher dimensional (both the design variable space and objective space) problem, more iterations will be needed to improve the distribution of the Pareto points. However, the computational cost might be unacceptable in the real application because of the time consuming evaluation. We believe that the parallel EGO methods (Haftka et al. 2016), which generate multiple new points in one iteration, are more desired for the Pareto optimization rather than the single objective problem.

Fig. 13
figure 13

Normalized hypervolume obtained by MHVEI

5 Conclusion

The HVEI function has been proved as an outstanding statistical infill criterion for the multi-objective EGO method. However, the high computational cost has become the bottle neck to limit its application on the many objective optimization. Aiming at this problem, a new statistical infill criterion named MHVEI has been developed. The theoretical study shows that the MHVEI maintains all the desired properties of HVEI, but has a much lower computational complexity. The difference between HVEI and MHVEI is also studied theoretically, and it shows that compared with HVEI, the MHVEI more inclines to the global exploration. The new criterion is compared with several existing ones on Mathematical functions. The results show that the MHVEI and HVEI perform better than the other criteria on bi-objective problem, and MHVEI performs no worse than HVEI on lower dimensional (design variable space) problems but better on the higher dimensional ones.

The results in Section 4.1 show that maximizing the \( {I}_{hv}^{mod} \) can produce as well distributed Pareto points as directly maximizing the hypervolume. Therefore we believe the \( {I}_{hv}^{mod} \) can be used in the heuristic intelligence algorithms for MOO, such as the SMS-EMOA, when the computational cost becomes a matter of concern. In addition, the MHVEI is a statistical infill criterion which is based on the probability estimated by the Kriging surrogate in this work. In theory, this approach can be extended with other Gaussian process models, such as the stochastic radial basis functions.

Some problems have also been observed in our method. Because of the curse of dimension, when the number of design variables is higher, the surrogate will need more training data to properly approximate the objectives. And even if the accuracy of the surrogate is high enough, it still will need more points to explore the higher dimensional true Pareto frontier. Therefore, we believe that the multiple points infill criteria for the parallel computation is desired in the future. However, one question needs to be considered first that should the multiple points explore the multi modal of the infill criterion (like the approach for the single objective problem), or should they try to augment the obtained Pareto solutions on different region? In addition, although the MHVEI has a much lower computational complexity than the existing algorithms of HVEI, it is still more time-consuming than the other criteria which are not developed directly to improve the hypervolume. Therefore more efficient algorithms are still expected.