Keywords

Introduction

A plethora of practical engineering problems involve multiple conflicting objectives which have to be optimized simultaneously. Solving such problems requires more effort than single-objective optimization as they usually have many (possibly infinite) optimal solutions; such solutions compose the so-called Pareto optimal set.

To add further to the challenge, for many real-world optimization problems there is also an absence of algebraic objective or response function definitions. Examples are crash tests, chemical reactions, many laboratory experiments, etc. Therefore an important challenge in optimization practice is how to solve an optimization problem in the absence of an algebraic model of the system to be optimized. Such optimization problems are called black-box as the available information is just input–output data without prior knowledge of the characteristics or physics of the relationships involved.

Due to the lack of an analytical description of the objective functions, derivatives are unavailable and derivative-based optimization methods cannot be used. Moreover, in many practical applications, the objective functions (or the associated constraints) are very costly to evaluate and it is desirable to limit the number of evaluations. Consequently, for costly black-box multi-objective optimization problems, the main concern is to find the Pareto optimal set with as few function evaluations as possible. Traditional derivative-free methods based on direct search or gradient estimation via numerical differentiation are not usually viable as they require many more function evaluations than can be comfortably afforded.

A popular and successful approach for derivative-free optimization of costly black-box functions is to construct response surface models known as surrogate models (or metamodels) that mimic the behavior of the real-word process as closely as possible while being less resource-demanding to evaluate. Surrogate-based optimization methods became popular a few decades ago even though they were proposed much earlier [20]. Among the various potential surrogate models, polynomial response surface models [3], kriging [36], and radial basis functions (RBF) [6] are widely used in solving costly black-box optimization problems.

In recent years, much attention has been devoted to develop multi-objective optimization methods (e.g., see [31, 41, 49]) to deal with real-world applications characterized as costly multi-objective black-box optimization problems using surrogate models to replace the unknown objective functions. However, little attention has been focused so far on the impact of the initial sample on the performance of the developed algorithms. Every black-box optimization algorithm starts the optimization process with an initial sample, usually a very limited one in the case of expensive function evaluations. The initial sample provides some knowledge for the method to further investigate the decision space with the aim of finding the global optimum. When the evaluation of objective functions is costly, these evaluated points are usually fed to a surrogate model to predict the real response function values of unevaluated points. An inexpensive surrogate model is constructed based on an initial sample; the model is then used in a search for the next points to evaluate. This approach decreases the number of resource-consuming function evaluations, but suggests that the initial sample selected to build a surrogate model can strongly impact the efficiency of optimization. This consideration motivates our analysis of the sampling effect on the optimization search.

Sampling methods have been used for a wide range of purposes ranging from censuses and surveys [1, 48] to numerical and computational studies [5, 9] to experimental investigations in industry and science [11, 24, 29, 35]. In general, sampling methods can be used in two main types of studies: observational and experimental studies [4]. Observational studies aim to draw inferences about an entire space from a sample [34], whereas experimental studies aim to identify the cause–effect relationship between input and output variables through controlled experiments [40]. In the first case, sampling methods must provide a representative sample of the entire space; in the second case, sampling methods must provide a small informative sample selected from the set of feasible experiments (the decision space). It is the latter experimental scenario that is relevant when using sampling methods in an optimization context.

To our knowledge, there exist only a few studies investigating the impact of sampling methods in the context of single-objective surrogate-based optimization [26, 44] and multi-objective optimization [30]. With regard to the multi-objective field, Poles et al. [30] focused on evolutionary optimization algorithms. Evolutionary algorithms require thousands of function evaluations to achieve a good approximation of the Pareto optimal set; therefore, they are not suitable for costly optimization. Instead, we focus on methodologies that require hundreds of function evaluations. Indeed, this study aims to investigate the effect and importance of initial sampling techniques on methods suitable for costly multi-objective black-box optimization.

The remainder of this chapter is as follows. In section “Problem Description”, we recall the basic concepts related to black-box and multi-objective optimization. Widely used sampling methods and the concepts behind them are outlined in section “Sampling Methods”. In section “Experiments”, we present the experimental setup used in the study and we illustrate the experimental results. Section “Conclusions” summarizes the results obtained, provides insights and suggestions for future research directions, and draws some final conclusions.

Problem Description

The multi-objective optimization problem comprises multiple objective functions which are to be minimized simultaneously. It can be expressed in the following form:

$$\displaystyle{ \min \boldsymbol{f}(\boldsymbol{x}) ={\bigl ( f_{1}(\boldsymbol{x}),\mathop{\ldots },f_{m}(\boldsymbol{x})\bigr )}^{T}\qquad \text{subject to}\ \boldsymbol{x} \in S, }$$
(1)

where \(S \subset \mathbb{R}^{d}\) is the feasible set and \(f_{i}:\rightarrow \mathbb{R}\), \(i = 1,\mathop{\ldots },m\) (m ≥ 2), are objective functions to be minimized simultaneously. All objective functions are represented by the vector-valued function \(\boldsymbol{f}: S \rightarrow \mathbb{R}^{m}\). A vector \(\boldsymbol{x} \in S\) is called a decision vector and a vector \(\boldsymbol{z} =\boldsymbol{ f}(\boldsymbol{x}) \in \mathbb{R}^{m}\) is called an objective vector.

We assume that at least one of the functions f i is “costly”; that is, its evaluation requires a significant amount of resources and no analytic expression is available. Therefore, problem (1) is called a costly multi-objective black-box optimization problem.

In multi-objective optimization, the objective functions \(f_{1},\mathop{\ldots },f_{m}\) in (1) are typically conflicting. In that case, there does not exist a decision vector \(\bar{\boldsymbol{x}} \in S\) such that \(\bar{\boldsymbol{x}}\) minimizes f i in S for all \(i = 1,\mathop{\ldots },m\), but there exist a number (possibly infinite) of Pareto optimal solutions. In mathematical terms, a decision vector \(\bar{\boldsymbol{x}} \in S\) and its image \(\bar{\boldsymbol{z}} =\boldsymbol{ f}(\bar{\boldsymbol{x}})\) are said to be Pareto optimal or nondominated if there does not exist a decision vector \(\boldsymbol{x} \in S\) such that \(f_{i}(\boldsymbol{x}) \leq f_{i}(\bar{\boldsymbol{x}})\) for all \(i = 1,\mathop{\ldots },m\) and \(f_{j}(\boldsymbol{x}) <f_{j}(\bar{\boldsymbol{x}})\) for some \(j = 1,\mathop{\ldots },m\). If such a decision \(\boldsymbol{x} \in S\) does exist, \(\bar{\boldsymbol{x}}\) and \(\bar{\boldsymbol{z}}\) are said to be dominated by \(\boldsymbol{x}\) and its image \(\boldsymbol{z} =\boldsymbol{ f(\boldsymbol{x})}\), respectively. The Pareto optimal set in the objective space is also called the Pareto optimal front.

Sampling Methods

Sampling methods for experimental studies have been attracting a great deal of attention since the 1800s and have resulted in a dedicated field of research known as Design of Experiments. Their importance directly relates to the efficient collection of informative data, allowing for the quick delivery of robust results. This translates into considerable savings that minimize costs and time related to both physical (real-world) and computer-based experimentation.

Many sampling methods assume that the unknown objective function can be approximated by a simple model (e.g., linear or quadratic) and recommend samples located on the boundary of the design space. This assumption can be safely made if some knowledge exists of the objective function or if the approximation occurs locally (i.e., in a relatively small sub-area of the decision space) [17]. In black-box optimization, no knowledge exists regarding the objective function and the entire decision space is typically searched. Therefore, sampling methods are required to provide samples that are spread out across the entire decision space. Two such classes of methods are space-filling methods and low-discrepancy sequences. Space-filling methods aim to generate widespread samples using a range of different criteria including equally spaced intervals and distance measures. On the other hand, low-discrepancy sequences use a measure of uniformity (discrepancy) that minimizes the difference between the percentage of points falling in a particular region on a unit cube and the percentage of volume occupied by this region. The main space-filling designs and low-discrepancy sequences are reviewed below and investigated in our computational study.

Simple Random Sampling

In simple random sampling (SRS), N decision vectors are randomly sampled from the decision space [39]. Decision vectors have the same probability of being chosen; the constant chance of selection extends to pairs, triplets, and so on (e.g., any given pair of decision vectors has the same chance of selection as any other pair).

SRS is among the most popular sampling methodologies thanks to its simplicity and low computational demand. One drawback of SRS relates to its vulnerability to sampling error; indeed, the randomness of its selection process may result in a sample that is not evenly spread throughout the entire decision space. This is particularly true for small samples in high-dimensional regions that often exhibit apparent clustering and poorly covered regions [37]. Systematic and stratified sampling techniques have been developed to overcome this issue and choose a “more representative” sample.

Latin Hypercube Sampling

Latin hypercube sampling (LHS) is a stratified sampling technique. LHS controls how random samples are generated from a given probability distribution (usually uniform). To generate a sample of N decision vectors, the domain of each decision variable is divided into N equally spaced and non-overlapping intervals; then, one value is selected at random from each such interval. Random permutation of the resulting values for all decision variables results in a random Latin hypercube sample.

LHS originated in 1979 for computer-based experiments in order to address the need for a better and more efficient coverage of the decision space [22]. The authors showed that LHS reduces the variance in their chosen application of Monte Carlo integration.

The main advantage of LHS over SRS derives from its one-dimensional projection property: a Latin hypercube sample projected into one dimension results in a set of evenly distributed points in all dimensions separately. Due to this property, LHS is the most commonly used stratified sampling technique in many areas of computer-based experiments. Despite this, different studies show that it is not always the best choice [43, 44]. Indeed, LHS does not guarantee a uniform coverage (i.e., a good spread) of the decision space as (sometimes large) areas of the decision space might remain unexplored. Two such examples are reported in Fig. 1 showing an LHS in two dimensions with six intervals per decision variable. In both cases, there is a large area of the decision space that is not explored; therefore, if we use such a sample to develop a prediction model, then the prediction will be poor in those unexplored areas. In the worst case scenario (b), LHS can generate a sample with two perfectly correlated decision variables; such a sample causes the effects of the two variables to be completely confounded. To overcome these limitations, LHS methods have been improved through the adoption of an additional criterion; such improvements resulted in maximin distance and minimum correlation LHS (C-LHS) methods described later on.

Fig. 1
figure 1

Two LHS configurations with two variables in six intervals (a ) a non space-filling LHS (b ) a worst case LHS

Maximin Sampling

Maximin sampling belongs to the class of distance-based sampling methods. Distance-based sampling methods make use of the Euclidean distance to prevent sampled points from clustering too close together so that they over-represent some regions of the design space.

The aim of maximin sampling is to scatter points in the decision space such that the minimal pairwise distance between points is maximized. Let \(\boldsymbol{x}_{j} = (x_{j1},x_{j2},\ldots,x_{jd})\) and \(\boldsymbol{x}_{k} = (x_{k1},x_{k2},\ldots,x_{kd})\) be two different decision vectors in a sample D(N, S), where N is the sample size and S is the d-dimensional feasible set. The following mathematical problem must be solved:

$$\displaystyle{ \text{max }\text{min }s^{2}(\boldsymbol{x}_{ j},\boldsymbol{x}_{k}) }$$
(2)

where

$$\displaystyle{ s^{2}(\boldsymbol{x}_{ j},\boldsymbol{x}_{k}) =\sum _{ i=1}^{d}\left (\frac{x_{ji} - x_{ki}} {U_{i} - L_{i}} \right )^{2} }$$

and \(\boldsymbol{x}_{j},\boldsymbol{x}_{k} \in D(N,S)\), \(j,k = 1,2,\ldots,N\) (jk), U i and L i are the upper and lower limits of the ith variable. Therefore, a maximin sample of size N contains minimum pairwise distances that are maximum compared to any other N-sized sample.

Maximin sampling was first introduced by Johnson et al. [14]. It is among the best methods to obtain an even coverage of the decision space. However, it tends to prioritize decision vectors that are located near the boundary of the decision space. Also, despite computational efficiency in low dimensions, the method is very demanding in high dimensions. To overcome this issue, various approximate maximin sampling methods have been developed. Approximate methods use conventional nonlinear programming algorithms to reduce the computational cost of the procedure at the expense of potentially providing solutions that are not globally optimal, only locally optimal.

Here, we propose a simple approximation method we call “Nearly maximin” consisting of the following steps:

Step 1:

Randomly generate a decision vector \(\boldsymbol{x}_{1}\) from the decision space and choose it as the first element of the sample;

Step 2:

Randomly generate n decision vectors from the decision space and for each vector, calculate the Euclidean distance to the closest element of the existing sample; and

Step 3:

Choose the one having the maximum distance out of the n decision vectors as the next element of the sample;

Repeat Step 2 and Step 3 until the sample comprises N decision vectors.

In a preliminary study that will be published elsewhere, the Nearly maximin method showed extremely promising results. It will be used in our computational study to allow for the investigation of high-dimensional test problems.

Maximin LHS

Both LHS and maximin sampling produce samples with attractive properties. LHS guarantees that the one-dimensional projection of the sample presents an even spread in the variables’ domains; maximin guarantees that no two elements (decision vectors) in the sample are close together. However, both methods suffer from limitations. In particular, LHS might occasionally generate samples with points that are close to each other as in the examples of Fig. 1, whereas maximin tends to select samples that are located near the boundary of the decision space.

To overcome these limitations, Morris and Mitchel [25] suggested that LHS be combined with the maximin criterion. The resulting method is known as Maximin LHS (M-LHS). It consists of (a) generating the maximum number of possible LHS samples, (b) measuring their maximin distances, and (c) selecting the most evenly spread sample (optimal sample).

M-LHS preserves the one-dimensional projection property of LHS while ensuring that no two points in the LHS design are very close to each other. Therefore, a good spread of decision vectors is achieved not just in each single variable domain but also in the entire decision space. Also, the decision vectors in the sample are preferentially located in the interior of the decision space thus providing a compromise between maximin property and good projective properties in each dimension (as guaranteed by Latin hypercubes) [25]. Unfortunately, constructing samples by M-LHS can be quite time consuming when the number of dimensions and design points increase. Indeed, there exist (N! )d−1 LHS samples for N divisions and d dimensions; for each such sample, the maximin distances need to be calculated in order to identify the optimal one.

Correlation LHS

To find optimal LHS Iman and Conover [12], Owen [28], and Tang [42] proposed to use a criterion minimizing correlation between the factors. This is useful in applications requiring a sample to be composed of decision vectors without (or with small) correlation. Owen proposed to measure the goodness of LHS with respect to a criterion of minimum pairwise correlations which is defined as follows:

$$\displaystyle{ \rho ^{2} = \frac{\sum _{i=2}^{d}\sum _{ j}^{i-1}\rho _{ ij}^{2}} {d(d - 1)/2}, }$$
(3)

where ρ ij is the pairwise correlation between columns i and j of the design, and ρ ij  ∈ [0, 1]. The smaller the ρ 2 is, the weaker the pairwise correlation is.

In C-LHS method suggested by Owen [28], the sum of between-column squared correlation is decreased by alternating forward and backward Gram–Schmidt orthogonalization. In our computational study we used the Matlab implementation of Owen’s method.

One might think that minimizing the correlation should spread out the points and maximizing the distance between the points should reduce the correlation. However in practice, there is no one-to-one relationship between the two, and designs obtained by these two criteria can be quite different [15]. In other words, C-LHS not necessarily provides a well spread sample.

Halton Sequence Sampling

The Halton sequence sampling method generates quasi-random numbers of high-dimensionality with a high level of uniformity across the space. Halton sequence is constructed according to a deterministic method that uses different prime bases for different dimensions to create a d-dimensional low-discrepancy sequence [7, 19, 21]. The method is based on the fact that each non-negative integer can be expanded using a prime base. Construction of Halton sequence in d-dimensional space is as follows:

  1. 1.

    Choose d prime integers \(p_{1},p_{2},\ldots,p_{d}\) (usually the first primes p 1 = 2, \(p_{2} = 3,\ldots,\) are selected).

  2. 2.

    To generate the i-th sample, consider the base p representation for i in which:

    $$\displaystyle{ i = a_{0} + a_{1}p + a_{2}p^{2} + a_{ 3}p^{3}+\ldots }$$

    where each a j is an integer in [0, p − 1].

  3. 3.

    The next point in [0, 1] is achieved by reversing the order of the bits and moving the decimal point:

    $$\displaystyle{ r(i,p) = \frac{a_{0}} {p} + \frac{a_{1}} {p^{2}} + \frac{a_{2}} {p^{3}} + \frac{a_{3}} {p^{4}}+\ldots }$$
  4. 4.

    Starting from i = 0, the i-th sample in the Halton sequence is

    $$\displaystyle{ (r(i,p_{1}),r(i,p_{2}),\ldots,r(i,p_{d})) }$$
    (4)

Halton is an extension of the Van der Corput sequence, which was originally introduced for one dimension and a base of 2. The Van der Corput sequence is obtained by using p = 2. However, Halton sequences based on large primes (d > 10) can be highly correlated, and their coverage can be worse than that of the pseudo-random uniform sequences.

Hammersley Sequence Sampling

The Hammersley sequence [8, 21] belongs to the class of low-discrepancy sequences, and is closely related to the Fibonacci series. The Hammersley sequence is an adaptation of the Halton sequence (4) when the required sample size N is known. In such a case, a better uniformly distributed sample can be obtained by using only d − 1 distinct primes. In a Hammersley sequence with N elements and starting from i = 0, the i-th d-dimensional vector will be

$$\displaystyle{ \left ( \frac{i} {N},r(i,p_{1}),r(i,p_{2}),\ldots,r(i,p_{d-1})\right )\quad \mbox{ for $i = 0,1,2,\ldots,N - 1.$} }$$
(5)

Hammersley sequence sampling provides better uniformity properties over LHS [23]; in particular, the chance of samples with clustered decision vectors is lower. Also, compared to other conventional techniques, Hammersley sampling requires far smaller samples to approximate the mean and variance of distributions based on empirical studies [16].

Sobol Sequence Sampling

Sobol sequence sampling is an improved version of the Halton and Hammersley methods. Indeed, despite the Halton and Hammersley methods being relatively simple and efficient, they suffer from a common pitfall—the performance of these two sampling methods degrades substantially in higher dimensions. Sobol sequences have been proposed to approximate the integral over the d-dimensional unit cube:

$$\displaystyle{ \lim _{n\rightarrow \infty }\frac{1} {n}\sum _{i=0}^{n-1}f(x_{ i}) =\int _{[0,1]^{d}}\!f(x)\,\mathrm{d}x }$$

where f is a real integrable function over a d-dimensional unit hypercube and \(x_{0},\ldots,x_{n-1}\) are n points in [0, 1]d comprising a “Sobol sequence.” The Sobol sequence, as originally defined by Sobol [38], is generated from a set of special binary vectors of length w bits, v i j, \(i = 1,2,\ldots,w\), \(j = 1,2,\ldots,d\). These numbers, v i j, are called direction numbers. To generate them for dimension j, one should begin with a primitive polynomial over the finite field \(\mathcal{F}_{2}\) with elements {0, 1}. Let us assume that the primitive polynomial is

$$\displaystyle{ p_{j}(x) = x^{q} + a_{ 1}x^{q-1} +\ldots +a_{ q-1}x + 1. }$$

Then we use its coefficients to define a recurrence relation for calculating v i j, the direction number in dimension j. It is generated using the following q-term recurrence relation:

$$\displaystyle{ v_{i}^{j}(x) = a_{ 1}v_{i-1}^{j} \oplus a_{ 2}v_{i-2}^{j} \oplus \ldots \oplus a_{ q-1}v_{i-q+1}^{j} \oplus v_{ i-q}^{j} \oplus (v_{ i-q}^{j}/2^{q}), }$$

where i > q, ⊕ denotes the bitwise XOR operation, and the last term is v iq shifted right q places. The initial numbers \(v_{1}^{j} \cdot 2^{w},v_{2}^{j} \cdot 2^{w},\ldots,v_{q}^{j} \cdot 2^{w}\) can be arbitrary odd integers smaller than \(2,2^{2},\ldots,2^{q},\) respectively. The Sobol sequence x n j \((n =\sum _{ i=0}^{w}b_{i}2^{i},\,b_{i} \in \{ 0,1\})\) in dimension j is generated by

$$\displaystyle{ x_{n}^{j} = b_{ 1}v_{1}^{j} \oplus b_{ 2}v_{2}^{j} \oplus \ldots \oplus b_{ w}v_{w}^{j}. }$$

Different primitive polynomials should be used to generate Sobol sequence in each dimension. Currently there are more efficient ways of generating Sobol sequences proposed in the literature (see, e.g., [27]).

Summary of Sampling Methods

This section outlines the main characteristics of sampling methods discussed above. Table 1 summarizes the sampling methods in terms of their main features. Samples consisting of 32 points in two-dimensional space, and generated by different sampling methods, are presented in Fig. 2 for a visual comparison.

Table 1 Characteristics of sampling methods
Fig. 2
figure 2

Initial sample in 2D, number of points = 32

An important consideration relates to the methods’ computational cost in the context of costly optimization. For those methods that demand moderate to intensive computational efforts, it is important to investigate the compromise between (a) the time required to generate the sample and (b) the sample quality. The sample quality is its ability to decrease the number of further function evaluations without affecting the results of the optimization procedure. Obviously, if function evaluations are highly costly and involve resources other than time, even a small decrease in the number of function evaluations justifies the higher computational time required to generate the optimal sample.

Experiments

The impact of sampling method on multi-objective optimization algorithm efficiency is evaluated by means of a comprehensive benchmark problem set. The design of the experimental study is described in section “Experimental Setup”. Section “Test Problems” gives an overview of the benchmark problems. The major part of this section is devoted to discussion of the obtained results and the appropriate observations. This is covered in section “Results”.

Optimization Algorithms Considered

In our study, we considered three algorithms designed for costly multi-objective optimization problems, namely ParEGO, SMS-EGO, and ε-EGO. These algorithms were selected due to their available implementation in the R package mlrMBO [2].

ParEGO is a state-of-art algorithm developed by Knowles [18]. It uses the augmented Tchebycheff norm to convert a multi-objective problem into a scalarized one:

$$\displaystyle{ f_{\lambda }(\boldsymbol{x}) =\mathop{ \mathrm{max}}\nolimits _{j=1,\ldots,m}\big(\lambda _{j}f_{j}(\boldsymbol{x})\big) \pm \rho \sum _{j=1}^{m}\lambda _{ j}f_{j}(\boldsymbol{x}), }$$
(6)

where ρ > 0 is a small positive number and λ is a weight vector. ParEGO randomly selects w from a uniformly distributed set in each iteration. Then a surrogate model is fitted to the respective scalarized problem. At each iteration of the algorithm, a different weight vector is drawn uniformly at random from the set of evenly distributed vectors allowing the model to gradually build up an approximation to the true Pareto set. Before scalarization, the objective functions are normalized with respect to the known (or estimated) limits of the objective space to the range [0, 1]. At each iteration, the method uses a genetic algorithm to search for the solutions that maximize an infill criterion, called expected improvement, with respect to a surrogate model. Only the best solution is evaluated on the actual problem. After evaluation of the selected solution on the real expensive function, ParEGO updates the GP surrogate model of the landscape and repeats the same steps.

The other two algorithms do not convert a multi-objective optimization problem to a single optimization problem but use a multi-objective optimization of infill criteria on each objective in order to obtain a candidate set for evaluation. SMSEGO [31] optimizes the hypervolume and ε-EGO [45] looks at search solutions with respect to the additive ε-indicator which has been introduced by Zitzler et al. [51]. An additive ε-indicator of approximation set A gives the minimum value ε by which each point in the real front R can be added such that resulting transformed approximation is dominated by A.

Test Problems

The test set consists of different benchmark problems with a variety of characteristics in both the decision and objective spaces. The objectives of test problems can be either unimodal (U) or multimodal (M). Multimodal problems are more difficult than unimodal problems, and more representative of real-world problems. The Pareto optimal front can be convex, linear, concave, disconnected, or some combination of the former. It is well known that the type of Pareto front can directly affect the performance of the optimization algorithms. For example, disconnected Pareto fronts can increase the likelihood that an algorithm will fail to find all regions of the Pareto optimal front. The fitness landscape may be one-to-one or many-to-one and the latter property impacts some algorithms’ ability to find multiple, otherwise equivalent optima. For a more detailed discussion on test problems properties we refer readers to [10].

Our test set includes the following benchmark problems:

  • OKA2 m = 2, d = 3. The true Pareto optimal set for this problem is a spiral shaped curve in the decision space, and the density of the Pareto optimal solutions in the objective space is low.

  • Kursawe This problem has a scalable number of decision variables. In our experiment we used d = 3, m = 2. Its Pareto optimal set is disconnected and symmetric in the decision space, and disconnected and concave in the objective space.

  • Viennet m = 3, d = 2. The true Pareto optimal set is convex in the objective space.

  • ZDT family: ZDT problems share such characteristics as multimodality, discontinuity, and possession of multiple Pareto fronts; for all problems, m = 2 and d is scalable, however we used d values suggested by the authors.

    • ZDT1: d = 30; Pareto optimal set in the objective space is convex.

    • ZDT2: d = 30; Pareto optimal set in the objective space is nonconvex.

    • ZDT3: d = 30; Pareto optimal set is disconnected in both objective and decision spaces. Pareto optimal set consists of one mixed convex/concave component and several convex components in the objective space.

    • ZDT4: d = 10; first objective function is unimodal, while the second objective function has multiple local optima and therefore is highly multimodal. Its Pareto optimal set in the objective space is convex [10].

    • ZDT6: d = 10; it has a nonuniform search space, i.e., the Pareto optimal solutions in the decision space are non-uniformly distributed along the global Pareto set, and also the density of the solutions is lowest near the Pareto optimal set and highest away from it. Pareto optimal set in the objective space is concave.

  • DTLZ1: It is a scalable problem in both objective and decision space and has multiple global optima. Thus, the only difficulty provided by this problem is convergence to the Pareto optimal hyperplane. We solved three sizes of this problem: (1) m = 4 and d = 13; (2) m = 6 and d = 15; and (3) m = 8 and d = 17.

The major characteristics of the selected benchmark problems are summarized in Table 2.

Table 2 Summary of test problems characteristics

Performance Assessment

In multi-objective optimization, the definition of solution quality is substantially more complex than for single-objective problems as the optimization goal itself consists of several objectives such as convergence to the true Pareto frontier, uniform distribution of obtained nondominated solutions, and maximum extent of obtained nondominated set with respect to each objective. Therefore, a number of quality metrics usually taking into account one solution quality characteristic have been proposed (see, e.g., [13, 47]). The most widely used performance metric is a hypervolume (HV) indicator (also known as an S-metric) [50] which defines the size of the region dominated by the relevant Pareto set approximation. As such it provides information about closeness and diversity at the same time. In addition, it possesses a desirable property: whenever one approximation completely dominates another approximation, the HV of the former will be greater than the HV of the latter [52]. The HV metric corresponds to the size of the region of the objective space bounded by a reference point. In our study, we calculated the HV metric using normalized values of the objective functions.

Experimental Setup

In this study we control: (a) the size of the initial sample, (b) the optimization budget, (c) the dimension of decision space, and (d) the dimension of objective space.

The initial design size was set to \(n_{\text{init}} = 11d - 1\) based on the recommendations in [18]. An example of the initial samples generated by different sampling methods for two decision variables is given in Fig. 2. The number of optimization iterations was restricted to 200 resulting in a total budget of \(n_{\text{total}} = 200 + 11d - 1\). Taking into account the different number of dimensions, the algorithms were evaluated on the 11 test problems discussed in section “Test Problems”.

The Pareto front approximations of the algorithms were compared not only at the last iteration (n = 200) but as well at intermediate iterations (n = 50, 100, and 150) with respect to the HV metric. For each test function the reference point was estimated based on the nondominated set of initial samples.

With regard to the sampling methods, it is important to point out the following aspects. It can be computationally very expensive to achieve optimal maximin and low C-LHS samples; therefore, we have chosen the best sample out of 1000 randomly generated LHS samples with regard to the corresponding criterion (maximin or minimum correlation). The low-discrepancy sampling methods (i.e., Hammersley, Halton, and Sobol sequences) are deterministic (rather than stochastic) as there is no run-to-run difference between generated samples; therefore, we have used the “random-start sequence” trick [46]. By defining random starts for generation of these samples, the sequences differ in each run resulting in stochasticity of the samples.

Results

This section elaborates on a detailed analysis of the results from the experiments performed. In each numerical experiment an initial sample was generated by one sampling method running it 100 times. Then an optimization search was performed by each algorithm over these runs. Their performance has been estimated with respect to the HV metric.

Corresponding results of each sampling method were compared to those of the other seven methods, to determine if its results had a statistically significant advantage. Comparisons were performed using the unpaired t-test [32] and differences were deemed statistically significant at the 0.01 significance level. The significance was tested multiple times, i.e., after 50, 100, 1500, and 200 function evaluations.

The average performance of optimization algorithms against sampling methods on different test problems is represented in Figures 3, 4, 5, 6, and 7. However, in real-world applications, to know the average performance is not enough and usually the worst case scenarios are taken into account as well. The worst case scenario provides some additional information but sometimes it is considered as too conservative. Instead of the worst case scenario, we may want to know the mean of the realizations above a specified quantile; i.e., the conditional value-at-risk (CVaR) introduced in [33]. We calculated CVaR with a selected confidence level of 0.05, giving the average value over a distribution tail consisting of the 5 % worst realizations. Due to space limitations, we could not provide the graphs of all the optimization algorithms and test problems considered. Therefore, we have selected the ones providing most of the information and supporting the main observations.

Fig. 3
figure 3

ParEGO performance on OKA2 problem

Fig. 4
figure 4

SMS-EGO performance on Kursawe problem

Fig. 5
figure 5

ε-EGO performance on ZDT1 problem

Fig. 6
figure 6

SMS-EGO performance on ZDT2 problem

Fig. 7
figure 7

ParEGO performance on DTLZ1 problem with 8 objectives and 17 decision variables

We have noticed that the largest variability over 100 runs is produced by Hammersley, Sobol, and Halton sequences which by nature are deterministic methods as sequences are finite. However, in order to generate 100 runs we used Matlab functions with various leap and skip parameters values which produce different subsets of these sets, sometimes not fully covering the whole decision space. To our knowledge, there is no recommendation on how to select these parameters. Therefore, the average performance of the deterministic sampling methods is influenced by some initial samples not spread throughout the entire space.

According to the obtained results regarding the average performance of the optimization algorithms based on a normalized HV metric, we can observe some trends. Generally, it can be noticed that sampling methods do not affect optimization algorithm performance significantly (the difference of HV metric values over 100 runs is not statistically significant at the 1 % level) on the problems with objective and decision space dimensions both lower or equal to three. Also, we discovered that algorithm performance is very similar when using samples generated by stochastic sampling methods (namely, SRS, LHS, M-LHS, and C-LHS) on the bi-objective problems with high-dimensional decision space, and there is no statistically significantly difference among them as illustrated in Figs. 5 and 6. The ParEGO method with initial samples generated by LHS has not demonstrated the best performance on any single problem, while its improved versions (i.e., M-LHS and C-LHS) have shown very good performance on bi-objective problems with a larger number of decision variables. Hammersley sequence sampling and ParEGO showed the best performance or close to it on unimodal bi-objective problems of low-dimensionality with continuous Pareto front. Halton sequence sampling in conjunction with any of the considered optimization algorithms in most of the cases performed poorly, especially with a larger number of decision variables, except for low-dimensional problems with a convex Pareto front. Also, it has been outperformed by the other two deterministic techniques quite a number of times. M-LHS sampling technique paired with SMS-EGO proved to behave well on the problems with a larger amount of decision variables and is outperformed by other sampling techniques on smaller problems. All optimization algorithms demonstrated better average performance on problems with more than three objectives when using maximin for initial sampling (see, e.g., Fig. 7). SRS method can be considered an appropriate choice because the performance of selected optimization algorithms in most cases was not significantly worse than other stochastic sampling methods. Although, there is one exception for Kursawe problem optimized with SMS-EGO algorithm (see Fig. 4), where its initial samples lead to the statistically significant worst performance with respect to Hammersley and Halton sequences as well as maximin sampling method.

Conclusions

This section draws some conclusions and provides some recommendations based on the experiments performed and the results obtained. In addition, we shed some light on the impact of the selected sampling techniques on costly black-box multi-objective optimization, and discuss some future research questions which we leave for further investigation.

To summarize, sampling methods have no statistically significant impact (with a significance level 0.01) on the algorithm performance measured by the HV metric for low-dimensional problems, i.e., m, d ≤ 3. Therefore, SRS can be considered as an appropriate choice for low-dimensional problems.

Also, when using deterministic sampling methods, one has to check that the initial sample is a good representative sample in the sense of covering the entire decision space. Otherwise a “bad” initial sample can cause the optimization outcome to deteriorate significantly; i.e., the variance of the deterministic methods is larger than the stochastic sampling methods. Although, LHS is often used as a default sampling method in multi-objective optimization, the obtained results did not confirm it to outperform other sampling methods; one could use M-LHS or C-LHS instead as these sampling methods obtain better results in many cases. For high-dimensional problems, in both objective and decision spaces, deterministic methods led to large variability which resulted in significantly lower average algorithm performance compared to stochastic sampling methods. In particular, the maximin sampling method outperformed other stochastic methods though this advantage was not statistically significant.

We plan to continue research on a larger set of test problems with a larger number of both objective and decision variables possessing a variety of properties. Hopefully, this will provide greater insights and enable us to determine more concrete recommendations. Our conclusion for now is that choice of initial sample matters in higher dimensions. In this work, we have studied the algorithm performance with respect to the most widely used performance metric HV. It would be interesting to investigate the impact of sampling methods with respect to other metrics. Moreover, the question of what initial sample size one should use and how it affects the optimization results is also open. Clearly, there is a trade-off involved between the size of an initial sample and the number of evaluations used to run an optimization algorithm when dealing with costly real-world optimization problems. Thus, this research direction will be considered in the future as well.