1 Introduction

Problems of global optimization of continuous multimodal functions are attacked using deterministic [1,2,3,4,5,6,7], stochastic [8,9,10,11,12], and heuristic approaches [13,14,15]. A promising research direction is the hybridization of the advantages of some of the successful methods [16, 17]. Although the proposed algorithms are thoughtfully investigated there remains an important issue that has received relatively little attention, namely the convergence rate problem. In the present paper we focus on this problem continuing the research in [18,19,20,21] where the convergence rate was established for algorithms rooted in the approach based on statistical models of global optimization. The convergence rate of global optimization algorithms in a stochastic setting has been considered for certain Gaussian probabilities; see for example [19]. The general question of average-case tractability of multivariate global optimization is mentioned (Open Problem 87) in [22].

We consider the problem of approximating the global minimum \(f^*\) of a twice-continuously differentiable function f defined on the d-dimensional unit cube \(A=[0,1]^d\), for \(d\ge 2\). We assume that the global minimizer, denoted by \(x^*\), is unique and lies in the interior of the cube. While we assume differentiability, we are interested in algorithms that only evaluate the function and not derivatives. Our interest centers on how the number of function values required to obtain a given quality approximation grows with the dimension of the domain.

The algorithm presented here is similar to a one-dimensional algorithm presented in [20] and a two-dimensional algorithm based on Delaunay triangulations presented in [21]. Those papers presented algorithms with the property that for large enough N (depending on f), the residual error (difference between the smallest function value observed and the global minimum) after N function evaluations is at most

$$\begin{aligned} c_1 \exp (-\,c_2\sqrt{N}) \end{aligned}$$

for \(c_1,c_2\) depending on dimension (1 or 2) and f. While the algorithm presented in [21] was based on a Delaunay triangulation of the domain, the algorithm considered in this paper is based on a rectangular decomposition. Rectangular decompositions have been employed in previous studies of global optimization, for example in [4, 23,24,25]. The rectangular partition of the feasible region is inherent also for the interval arithmetic-based global optimization algorithms, see e.g. [26]. One of the advantages of the algorithms using rectangular partition of the feasible region is their suitability for parallelization; however the investigation of parallel versions of the algorithm proposed here is postponed for further research. For a discussion of the alternative simplicial partition of the feasible region in the development of global optimization algorithms we refer to [27].

The algorithm in [21] works for arbitrary domains, but the error bound, which required a certain quality triangulation, could only be proved for bivariate functions. In the present paper we are able to prove an asymptotic error bound for arbitrary dimension. Our main result is that eventually the error is at most

$$\begin{aligned} c_1(f,d) \exp \left( -\,c_2(f,d)\sqrt{N}\right) , \end{aligned}$$

where \(c_2(f,d)\) decreases exponentially in dimension d. Recall that we assume that f has continuous partial derivatives up to order 2, and that the global minimizer is unique and is in the interior of the domain.

In the worst-case setting any nonadaptive optimization algorithm has an error of order at least \(N^{-\,2/d}\). It is known that for a convex and symmetric function class such as \(C^2([0,1]^d)\) adaptive algorithms are essentially no more powerful than nonadaptive algorithms for linear problems such as function approximation with error measured using the \(L_\infty \) norm ([28]). Global optimization is as difficult as \(L_\infty \) approximation (see Theorem 18.24 in [22]). Therefore, \(N^{-\,2/d}\) is an optimal worst-case error bound for optimization on \(C^2([0,1]^d)\). To obtain an error of at most \(\epsilon \), we then require on the order of \((1/\epsilon )^{d/2}\) function values. For the algorithm proposed in this paper, we consider the asymptotic decay of the error, showing that for small enough \(\epsilon \) on the order of \(d^{d+3}\log (1/\epsilon )^2\) function values suffice. Because of the asymptotic setting, this does not contradict the worst-case results mentioned above.

With the algorithm defined in this paper, the error converges to zero for a much broader class of functions than the one we consider. For example, the error converges to zero for any continuous function defined on \([0,1]^d\), though it could be difficult to characterize the rate of convergence for such a broad class.

At the present stage of investigation we have not focused on the details of efficient software development, and have tested a rather simple implementation (without sophisticated data structures) of the algorithm with well known test functions. Our initial guess was that Delaunay triangulation-based algorithm would be considerably more efficient than the algorithm proposed in this paper. This guess was based on the observation that hyper-rectangles have more volume far from the vertices than simplexes. Also, in our Delaunay-based algorithm, each step placed a single function evaluation in a simplex, while subdividing a hyper-rectangle takes a number of evaluations that grows exponentially with dimension. However, at least for the low dimensional problems in our experiments, this guess was not confirmed. The results are presented in the last section of the paper where the testing results of the proposed algorithm and the algorithm based on Delaunay triangulation [21] are presented for comparison.

2 The algorithm

The algorithm operates by decomposing \([0,1]^d\) into hyper-rectangles as follows. Given a current decomposition, choose one of the hyper-rectangles (according to the maximal value of a criterion to be defined below) and bisect it along the longest axis by evaluating the function at up to \(2^{d-1}\) midpoints of the longest hyper-rectangle edges (some of the midpoints may have been evaluated in previous decompositions).

The criterion of choice of a rectangle is rooted in definition of the P-algorithm described in detail in [29]. The P-algorithm chooses the point for the current computing the an objective function value according to the criterion of maximum of improvement probability

$$\begin{aligned} \max _{x\in A}\mathrm{\mathbf P}\{\xi (x)\le y_{0N}-\varepsilon |\xi (x_i)=y_i,\,i=1,\ldots ,N\}, \end{aligned}$$

where \(\xi (x)\) is a Gaussian random field accepted as a statistical model of objective functions, and \(y_i\) values of the objective function computed at previous iterations, and \(y_{0N}=\min _{i=1,\ldots ,N}y_i\), \(\varepsilon >0\). The maximization of the improvement probability can be simplified because of the equality

$$\begin{aligned}&\arg \mathop {\max }\nolimits _{x\in A}\mathrm{\mathbf P}\{\xi (x)\le y_{0N}-\varepsilon |\xi (x_i)=y_i,\,i=1,\ldots ,N\} \nonumber \\&\quad =\arg \mathop {\max }\nolimits _{x\in A} \frac{y_{0N}-\varepsilon -\mu (x|(x_i,y_i),\,i=1,\ldots ,N)}{\sigma (x|(x_i,y_i),\,i=1,\ldots ,N)}, \end{aligned}$$
(1)

where \(\mu (x|(x_i,y_i),\,i=1,\ldots ,N)\) and \(\sigma ^2(x|(x_i,y_i),\,i=1,\ldots ,N)\) denote conditional mean and variance correspondingly. The criterion for the selection of a rectangle by the algorithm considered in the present paper is a modification of (1) where conditional mean and variance are replaced by the relevant parameters of the rectangle, and \(\varepsilon \) is replaced by a parameter \(g(\cdot )\) which is defined below; its form of the dependence on the number of iteration is determined aiming at fast asymptotic convergence of the algorithm,

Suppose that the algorithm has evaluated f at N points. Let s(N) denote the number of hyper-rectangles in our partition. Each hyper-rectangle subdivision (iteration) requires up to \(2^{d-1}\) function evaluations; let us note that at the current iteration some function values can be known since computed at previous subdivisions. Therefore, the following inequality is valid

$$\begin{aligned} s(N) \ge \frac{N}{2^{d-1}}. \end{aligned}$$
(2)

We next define some constants and functions that are used in the definition of the algorithm. Define

$$\begin{aligned} q\equiv \frac{3\cdot 2^{2/3} e^{-1}}{2\log (2)} \approx 1.27 ,\ \ \ \lambda \equiv (q d^2/4)^{d/2}, \end{aligned}$$

and set

$$\begin{aligned} g(x) \equiv \left( \lambda x \log (1/x)\right) ^{2/d}, \end{aligned}$$

for \(0<x\le 1/2\) and \(g(1)=q d^2/4\). Note that g is increasing and \(g(x)\downarrow 0\) as \(x\downarrow 0\).

It will be convenient to index quantities to be defined by the iteration number of the algorithm, which we denote by n, instead of the number of function evaluations, N, which is generally larger. Later we will also describe the error bound in terms of the number of function evaluations N.

After n iterations of the algorithm, let \(M_n = \min _{1\le i\le n}f(x_i)\) and denote the error by \(\varDelta _n = M_n-f(x^*)\). Define the function \(L_n:[0,1]^d\rightarrow \mathbb {R}\) as follows. In the interior of hyper-rectangle R, \(L_n\) coincides with the multilinear interpolant of the function values at the vertices of R. On the boundaries of hyper-rectangles set \(L_n\equiv M_n\), so that \(M_n\) is equal to the global minimum of \(L_n\). Let \(v_n\) denote the smallest volume of a hyper-rectangle after n iterations.

Our algorithm will assign a numerical value to each hyper-rectangle in the subdivision. After n iterations of the algorithm, for each hyper-rectangle \(R_i\), \(1\le i\le {n}\), set

$$\begin{aligned} {\rho }^n_i\equiv \int _{R_i}\frac{ds}{\left( L_n(s)-M_n + g(v_n)\right) ^{d/2}}, \end{aligned}$$
(3)

and

$$\begin{aligned} \overline{\rho }^n_i\equiv \frac{\left| R_i\right| }{\left( \inf _{t\in int(R_i)}L_n(t)-M_n + g(v_n)\right) ^{d/2} }. \end{aligned}$$
(4)

The criteria (3) and (4) are analogues of reciprocal (1) with opposite sign where \(g(v_n)\) is a generalisation of \(\varepsilon \). Since the hyper-rectangle boundaries have measure zero, the way we defined \(L_n\) on the boundaries is immaterial for the values of the \(\rho ^n_i\). Note that the \(\inf \) in the denominator of (4) is equal to the minimum of f at the vertices of the hyper-rectangle.

If we are about to subdivide the smallest hyper-rectangle, then

$$\begin{aligned} \rho _i^n \le {\overline{\rho }}^n_i \le \frac{\left| R_i\right| }{g(v_n)^{d/2}} = \frac{v_n}{\lambda v_n \log (1/v_n)} = \frac{1}{\lambda \log (1/v_n)} \le \frac{1}{\lambda \log (n)}, \end{aligned}$$
(5)

with the last inequality following from the fact that the smallest hyper-rectangle in the subdivision has volume \(v_n\le 1/n\).

A more formal description of the algorithm follows. In the description, N is the total number of function evaluations to make, i is the current number of hyper-rectangles, and j is the number of function evaluations that have been made. Let v, the volume of the smallest hyper-rectangle, take initial value 1 and let M, the minimum of the observed function values, take initial value the minimum of the function values at the vertices of \([0,1]^d\).

The algorithm comprises the following steps.

  1. 1.

    We start with the hyper-rectangle \([0,1]^d\) and with the function values at all \(2^d\) vertices. Set i, the number of hyper-rectangles, to 1 and set j, the number of function evaluations, to \(2^d\).

  2. 2.

    For each hyper-rectangle \(R_k\) in the current collection \(\{R_1,R_2,\ldots ,R_i\}\), compute \({\rho }^{{i}}_k\), keeping track of the rectangle with the lowest index \(\gamma \) that has the maximal value of \({\rho }^i_\gamma \).

  3. 3.

    From the hyper-rectangle \(R_\gamma \) with maximal value \(\rho ^n_\gamma \), form two new hyper-rectangles as follows. Suppose that

    $$\begin{aligned} R_\gamma = [a_1,b_1]\times [a_2,b_2]\times \cdots \times [a_d,b_d], \end{aligned}$$

    and that j is the smallest index with \(b_j-a_j\ge b_i-a_i\) for \(1\le i\le d\). The two new hyper-rectangles are

    $$\begin{aligned} R^{\prime }_{\gamma } \equiv [a_1,b_1]\times \cdots \times [a_j,(a_j+b_j)/2]\times \cdots \times [a_d,b_d] \end{aligned}$$

    and

    $$\begin{aligned} R^{\prime \prime }_{\gamma } \equiv [a_1,b_1]\times \cdots \times [(a_j+b_j)/2, b_j]\times \cdots \times [a_d,b_d]. \end{aligned}$$

    Evaluate f at the (at most) \(2^{d-1}\) points

    $$\begin{aligned} \left( c_1,\ldots ,\frac{a_j+b_j}{2},\ldots ,c_d\right) ,\ \ \ c_i\in \{a_i,b_i\},\ 1\le i\le d, \ i\ne j, \end{aligned}$$

    which have not been evaluated previously.

    If \(\left| R_\gamma \right| =v\), then set \(v\leftarrow v/2\), and if the smallest of the new function values is less than the previously observed minimum M, then set M to that new smallest value. Update the number of hyper-rectangles \(i\leftarrow i+1\) and increment j by the number of new function evaluations.

  4. 4.

    If \(j<{N}\), return to step 2.

Typical subdivisions for the two-dimensional case are depicted in Fig. 1 in Sect. 5. In the graphs, each vertex of the mesh corresponds to a point where the function has been evaluated. Each hyper-rectangle has been evaluated at its four vertices, as well as possibly other points on the perimeter. The function \(L_n\) agrees on the interior of a hyper-rectangle with the multilinear interpolation of the four vertices, disregarding additional vertices that may lie on the perimeter. The aspect ratio of each hyper-rectangle is either 1 or 2.

Let \(\hat{F}\subset C^2\left( [0,1]^d\right) \) denote the set of functions that have a unique global minimizer \(x^*\in ]0,1[^d\). Our error bound will hold for functions in \(\hat{F}\). It is essential for our result that the function has only a finite number of isolated global minimizers; the uniqueness assumption is to simplify the presentation.

Denote the matrix of second-order partial derivatives at the minimizer by \(D^2f(x^*)\), assumed positive definite.

The error bound will depend on the regularity of the function as measured by a seminorm that we now define. For a compact set \(K{\subset [0,1]^d}\) and \(f\in C^2(K)\), define the seminorm

$$\begin{aligned} \left\| D^2f\right\| _{\infty , K} \equiv \sup _{x\in K} \sup _{\begin{array}{c} u_1,u_2\in \mathbb {R}^d\\ \left\| u_i\right\| =1 \end{array} } \left| D_{u_1}D_{u_2}f(x)\right| , \end{aligned}$$

where \(D_yf\) is the directional derivative of f in the direction y. This is a measure of the maximum size of the second derivative of f over K.

Our main result on the convergence rate of the error for this algorithm follows. Let note, that the convergence of the original version of the P-algorithm for continuous objective functions we refer to ([29]) where the improvement threshold \(\varepsilon \), analogous to \(g(\cdot )\) in (3), (4), can be any positive constant. With such a weaker assumption, however, the convergence rate remains not known.

Theorem 1

Suppose that \(f\in \hat{F}\). There is a number \(n_0(f)\) such that for \(n\ge n_0(f)\), the residual error after n iterations of the algorithm satisfies

$$\begin{aligned} \varDelta _n{(f)\equiv \min _{1\le i\le n}f(x_i)-f(x^*)} \le \frac{{4^{2/d}}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) \exp \left( -\,\sqrt{n} \beta (f,d)\right) , \end{aligned}$$

where

$$\begin{aligned} \beta (f,d) = \left( \frac{2\sqrt{\pi }\left( \det (D^2f(x^*))\right) ^{1/2}}{9e d^{(d+3)/2}(\pi e q)^{d/2}} \exp \left( -\,\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2}. \end{aligned}$$

The number \(n_0(f)\) after which the bound applies will be the maximum of several other \(n_i(f)\)’s that will be defined in the proof of the theorem. We note here that \(n_0(f)\) depends (exponentially) on the value of the seminorm \(\left\| D^2f\right\| _{\infty , [0,1]^d}\).

Note that the limiting error decreases when the determinant of the second derivative at the minimizer increases; a larger second derivative allows the search effort to concentrate more around the minimizer. The convergence rate in terms of the number of function evaluations n is quite fast, but the term \(\beta (f,d)\) decreases exponentially fast as d increases: as \(d\rightarrow \infty \), \(\beta (f,d) = O\left( d^{-(d+3)/2}\right) \). Therefore, this algorithm does not escape the curse of dimension.

We focus on the information complexity; that is, the cost is the number of function evaluations. The worst-case combinatorial cost of the algorithm grows quadratically with the number of iterations.

3 Interpolation error bounds

We will use results about interpolation from [30].

Lemma 1

Consider a hyper-rectangle \(R=\{x\in [0,1]^d: a_i\le x_i\le b_i, 1\le i\le d\}\), and let L denote the multilinear function that interpolates f at the vertices of the hyper-rectangle. We have the error bound

$$\begin{aligned} \left| f(x)-L(x)\right| \le \frac{1}{2}\sum _{i=1}^d(x_i-a_i)(b_i-x_i) \left\| D^2f\right\| _{\infty , R}\\ = \frac{1}{2}\frac{\sum _{i=1}^d(x_i-a_i)(b_i-x_i)}{\left| R\right| ^{2/d}} \left\| D^2f\right\| _{\infty , R}\left| R\right| ^{2/d}, \end{aligned}$$

which implies the bound

$$\begin{aligned}&\max _{x\in R}\left| f(x)-L(x)\right| \le \frac{1}{8}\sum _{i=1}^d(b_i-a_i)^2 \left\| D^2f\right\| _{\infty , R}\\&\quad = \frac{1}{8}\frac{\sum _{i=1}^d(b_i-a_i)^2}{\left| R\right| ^{2/d}} \left\| D^2f\right\| _{\infty , R}\left| R\right| ^{2/d}. \end{aligned}$$

Proof

See [30]. \(\square \)

The quantity

$$\begin{aligned} \frac{\sum _{i=1}^d(b_i-a_i)^2}{\left| R\right| ^{2/d}} = \frac{\sum _{i=1}^d(b_i-a_i)^2}{\left( \prod _{i=1}^d(b_i-a_i)\right) ^{2/d}} \end{aligned}$$

is a measure of how “bad” the hyper-rectangle is from the point of view of interpolation error; that is, how far it is from a cube. Since by construction of the algorithm the widths are within a factor of 2 of each other, there is some number \(1\le k\le d\) and a positive number w such that k of the dimensions have width w and \(d-k\) have width 2w, and

$$\begin{aligned} \frac{\sum _{i=1}^d(b_i-a_i)^2}{\left( \prod _{i=1}^d(b_i-a_i)\right) ^{2/d}} =\frac{k w^{2} + (d-k)(2w)^2}{(2^{d-k}w^d)^{2/d}} =\frac{(4d-3k)}{2^{2(d-k)/d}} =\left( d-\frac{3}{4}k\right) 2^{2k/d}. \end{aligned}$$

For \(k=d\) we get the value d and for \(k=1\) we get the value \((d-3/4)2^{2/d}\). Maximizing

$$\begin{aligned} d(1-3x/4)2^{2x} \end{aligned}$$

over \(x\in ]0,1[\) gives a maximum value of

$$\begin{aligned} \frac{3\cdot 2^{2/3} e^{-1}}{2\log (2)} d\equiv q\cdot d \approx 1.27 d. \end{aligned}$$

Therefore, the quality metric ranges from d up to about 1.27d, and the second bound in Lemma 1 implies that

$$\begin{aligned} \max _{x\in R}\left| f(x)-L(x)\right| \le \frac{1}{8} q\cdot d \left\| D^2f\right\| _{\infty , R} \left| R\right| ^{2/d}. \end{aligned}$$
(6)

Also note that

$$\begin{aligned} d\le \frac{\sum _{i=1}^d(b_i-a_i)^2}{\left( \prod _{i=1}^d(b_i-a_i)\right) ^{2/d}} \le qd. \end{aligned}$$
(7)

In addition to \(\left\| D^2f\right\| _{\infty , [0,1]^d}\), we introduce a functional that is in a sense a measure of how difficult it is to obtain an \(\epsilon \) approximation of f. For \(\epsilon >0\), define

$$\begin{aligned} {\mathcal I_f}(\epsilon )\equiv \int _{[0,1]^d} \frac{dx}{\left( f(x) - f^* +\epsilon \right) ^{d/2}}. \end{aligned}$$

The following lemma, proved in “Appendix A”, generalizes Lemma 3.2 of [21] to arbitrary dimension.

Lemma 2

$$\begin{aligned} \lim _{\epsilon \downarrow 0}\frac{{\mathcal I_f}(\epsilon )}{\log (1/\epsilon )} = \frac{d(2\pi )^{d/2}}{2\varGamma (1+d/2)} \cdot \left( \det (D^2f(x^*))\right) ^{-1/2} \equiv \alpha (f,d). \end{aligned}$$

4 Proof of Theorem 1

We begin with a rough outline the steps of the proof of Theorem 1, which will be based on a series of lemmas. The proofs of the lemmas appear in “Appendix B”.

Lemma 4 establishes that the hyper-rectangle that contains the global minimizer cannot be much larger than the smallest rectangle at that iteration; this fact allows us to bound the error in terms of \(v_n\), the volume of the smallest rectangle. Recalling that \(\rho ^n_i\) is the metric used to choose the next rectangle to split, Lemmas 5 and 6 establish that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n\rho ^n_i \sim \frac{1}{\log (1/v_n)}, \end{aligned}$$

while Lemma 7 shows that

$$\begin{aligned} \sum _{i=1}^n\rho ^n_i \sim {\mathcal I}_f(g(v_n)). \end{aligned}$$

With this, Lemma 2 allows us to bound \(g(v_n)\), and thereby the error. All of the analysis relies on Lemma 3, which gives an upper bound for \(\rho ^n\).

We consider a fixed function \(f\in \hat{F}\). The subscript n will indicate the number of iterations of the algorithm (not the number of function evaluations). We assume that \(n\ge n_2(f)\), where \(n_2(f)\) is the first time after \(n_1(f)\) that the smallest rectangle is about to be split and

$$\begin{aligned} n_{{1}}(f) = \inf \left\{ n: n\ge \exp \left( \left( d\left\| D^2f\right\| _{\infty , [0,1]^d}\right) ^{d/2}\right) \right\} . \end{aligned}$$
(8)

(Later we will further restrict n.)

Lemma 3

For all \(n\ge n_2(f)\),

$$\begin{aligned} \rho ^n {\equiv \max \{\rho ^n_1,\rho ^n_2,\ldots ,\rho ^n_{n}\}}\le \frac{{4}}{\lambda \log (1/v_n)} \le \frac{{4}}{\lambda \log (n)}. \end{aligned}$$

Lemma 4

Let \(v_n^*\) denote the volume of the hyper-rectangle containing the minimizer \(x^*\). For \(n\ge n_2\),

$$\begin{aligned} v_n^* \le 4 v_n \end{aligned}$$
(9)

and

$$\begin{aligned} \varDelta _n \equiv M_n-f(x^*) \le 2 d^{-\,2} g(v_n). \end{aligned}$$
(10)

The following lemma gives a lower bound for \(\rho ^n\) that complements the upper bound provided by Lemma 3.

Lemma 5

For \(n\ge n_2(f)\),

$$\begin{aligned} \rho ^n \equiv \max _{i\le n}\rho ^n_i \ge \frac{\exp (-\,16/q)}{\lambda \log (1/v_n)}. \end{aligned}$$

Lemma 6

For \(n\ge n_3(f)\equiv 2n_2(f)\), we have the bounds

$$\begin{aligned} \frac{1}{{n}}\sum _{i=1}^{{n}} \rho ^n_i \le \rho ^n \le \frac{{4}}{\lambda \log (1/v_n)} \end{aligned}$$
(11)

and

$$\begin{aligned} \frac{1}{{n}}\sum _{i=1}^{ {n}}\rho ^n_i \ge \frac{1}{{2}}\exp \left( -\frac{1}{{2}}-\frac{16}{q}\right) \frac{1}{\lambda \log (1/v_n)}. \end{aligned}$$
(12)

The following lemma allows us to approximate the average of the \(\{\rho ^n_i\}\) in terms of \({\mathcal I}_f(\cdot )\), for which we have the approximation given by Lemma 2.

Lemma 7

For \(n\ge n_2(f)\),

$$\begin{aligned} {\frac{1}{2}} {\mathcal I_f}(g(v_n)) \le \sum _{i=1}^{{n}}\rho ^n_i \le {\frac{9}{4}}{\mathcal I_f}(g(v_n)). \end{aligned}$$
(13)

We are now ready to prove Theorem 1. From Lemma 6 we have the bounds

$$\begin{aligned} \frac{1}{{n}} \sum _{i=1}^{{n}}\rho ^n_i \le \frac{4}{\lambda \log (1/v_n)} \end{aligned}$$
(14)

and

$$\begin{aligned} \frac{1}{{n}}\sum _{i=1}^{{n}}\rho ^n_i \ge \frac{1}{{2}}\frac{\exp (-\,1/2-16/q)}{\lambda \log (1/v_n)}. \end{aligned}$$
(15)

Therefore

$$\begin{aligned} \frac{1}{8}\exp (-\,1/2-16/q) \frac{1/2}{{n}} {\mathcal I_f}(g(v_n)) \le \frac{1}{8}\exp (-\,1/2-16/q) \frac{1}{{n}} \sum _{i=1}^{{n}}\rho _i^n\\ \le \frac{1}{2}\exp (-\,1/2-16/q) \frac{1}{\lambda \log (1/v_n)} \le \frac{1}{{n}}\sum _{i=1}^{{n}}\rho _i^n \le \frac{{9/4}}{{n}} {\mathcal I_f}(g(v_n)), \end{aligned}$$

using the first inequality in (13), (14), (15), and the second inequality in (13), respectively. We focus on the inequalities

$$\begin{aligned}&\frac{1}{16}\exp (-\,{1}/2-16/q) \frac{1}{{n}} {\mathcal I_f}(g(v_n)) \le \frac{1}{4}\exp (-\,1/2-16/q) \frac{1}{\lambda \log (1/v_n)} \\&\quad \le \frac{{9/4}}{{n}} {\mathcal I_f}(g(v_n)). \end{aligned}$$

There is a number \(n_4(f)\) such that, by Lemma 2, \(n\ge n_4(f)\) implies that

$$\begin{aligned} \frac{1}{2}\log (1/g(v_n))\alpha (d) \le {\mathcal I_f}(g(v_n))\le 2\log (1/g(v_n))\alpha (d). \end{aligned}$$

Replace \({\mathcal I_f}(g(v_n))\) by \(\log (1/g(v_n))\) (times appropriate factor) in the extreme terms of (??) to get

$$\begin{aligned}&\frac{1}{32}\exp (-\,{1}/{2}-16/q) \frac{1}{{n}} \log (1/g(v_n))\alpha (d)\\&\quad \le \frac{1}{4}\exp (-\,1/{2}-16/q) \frac{1}{\lambda \log (1/v_n)} \le \frac{{9/4}}{{n}} 2\log (1/g(v_n))\alpha (d). \end{aligned}$$

Now substitute the value

$$\begin{aligned} \log (1/g(v_n))= -\,\log (qd^2/4) +\frac{2}{d}\left( \log (1/v_n)-\log \log (1/v_n)\right) \end{aligned}$$

to obtain the bounds

$$\begin{aligned}&\frac{1}{32}\exp ({1}/{2}-16/q) \frac{1}{{n}} \left( -\log (qd^2/4) +\frac{2}{d}\left( \log (1/v_n)-\log \log (1/v_n)\right) \right) \alpha (d)\\&\quad \le \frac{1}{4}\exp (-1/{2}-16/q) \frac{1}{\lambda \log (1/v_n)}\\&\quad \le \frac{{9/4}}{{n}} 2 \left( -\log (qd^2/4) +\frac{2}{d}\left( \log (1/v_n)-\log \log (1/v_n)\right) \right) \alpha (d). \end{aligned}$$

This implies that eventually, say for \(n\ge n_5(f)\),

$$\begin{aligned} \log (1/v_n) \ge \left( \frac{dn}{36\alpha (d)\lambda }\exp \left( -\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2}, \end{aligned}$$

which entails that

$$\begin{aligned} v_n \le \exp \left( - \left( \frac{dn}{36\alpha (d)\lambda }\exp \left( -\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2} \right) . \end{aligned}$$

Then

$$\begin{aligned} M_n-f^*\le & {} \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) v_n^{2/d}\\\le & {} \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty ,[0,1]^d} (q \cdot d) \exp \left( - \left( \frac{dn}{36\alpha (d)\lambda }\exp \left( -\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2} \frac{2}{d} \right) . \end{aligned}$$

For \(n\ge n_0(f) \equiv \max \{ n_2(f), n_3(f), n_4(f), n_5(f)\}\),

$$\begin{aligned}&\varDelta _n \le \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) \exp \left( - \left( \frac{dn}{36\alpha (d)\lambda }\exp \left( -\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2} \frac{2}{d} \right) \nonumber \\&\quad = \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) \exp \left( - \sqrt{n}\left( \frac{d}{36\alpha (d)\lambda }\frac{4}{d^2} \exp \left( -\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2} \right) .\nonumber \\ \end{aligned}$$
(16)

Substituting the value of \(\alpha \) from Lemma 2, and using the bound (from Stirling’s approximation)

$$\begin{aligned} \varGamma (1+d/2) \ge \sqrt{2\pi }(d/2)^{(1+d)/2} e^{-(1+d/2)}, \end{aligned}$$

we obtain

$$\begin{aligned} \frac{d}{36\alpha (d)\lambda }\frac{4}{d^2}&= \frac{1}{9\alpha (d)\lambda d}= \frac{1}{9 \frac{d(2\pi )^{d/2}}{2\varGamma (1+d/2)} \cdot \left( \det (D^2f(x^*))\right) ^{-1/2} \lambda d}\\&= \frac{2\varGamma (1+d/2) \left( \det (D^2f(x^*))\right) ^{1/2} }{9 d(2\pi )^{d/2} (qd^2/4)^{d/2} d}\\&\ge \frac{2 \sqrt{2\pi }(d/2)^{(1+d)/2} e^{-\,(1+d/2)} \left( \det (D^2f(x^*))\right) ^{1/2} }{9 d(2\pi )^{d/2} (qd^2/4)^{d/2} d}\\&= \frac{2 \sqrt{\pi } \left( \det (D^2f(x^*))\right) ^{1/2} }{9 e d^{3/2} (e\pi q d)^{d/2} }. \end{aligned}$$

Combined with (16) this yields the bound

$$\begin{aligned} \varDelta _n\le & {} \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d)\\&\quad \times \,\exp \left( - \sqrt{n}\left( \frac{2 \sqrt{\pi } \left( \det (D^2f(x^*))\right) ^{1/2} }{9 e d^{3/2} (e\pi q d)^{d/2} } \exp \left( -\,\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2} \right) \\= & {} \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) \exp \left( -\,\sqrt{n} \beta (f,d)\right) \end{aligned}$$

where

$$\begin{aligned} \beta (f,d) = \left( \frac{2\sqrt{\pi }\left( \det (D^2f(x^*))\right) ^{1/2}}{9 e d^{(d+3)/2}(\pi e q)^{d/2}} \exp \left( -\,\frac{1}{2}-\frac{16}{q}\right) \right) ^{1/2}. \end{aligned}$$

Finally, note that this is the expression for the error after n iterations of the algorithm, which corresponds to as many as \(2^{d-1}n\) function evaluations. After a total of N function evaluations, an upper bound on the error is given by

$$\begin{aligned} \frac{4^{2/d}}{8}\left\| D^2f\right\| _{\infty , [0,1]^d} (q \cdot d) \exp \left( -\,\sqrt{N} 2^{-(d-1)/2} \beta (f,d)\right) . \end{aligned}$$

This completes the proof of Theorem 1.

5 Numerical examples

The main goal of the research presented in the paper was the evaluation of convergence rate of the algorithm which generalizes the previously investigated algorithms for the case of arbitrary d. However, we do not expect the upper bound to be tight for typical test functions since the worst case objective function is considered. Therefore, including of some results of preliminary numerical experiments seemed reasonable to illustrate relation of the asymptotic convergence rate to the practical performance. The theoretical overestimate of the actual error is commented after the presentation of the experimental results.

A simple C++ implementation of the algorithm, applicable to the problems of low dimensionality, was used. Further, we plan to implement parallel version of the algorithm and perform the so called competitive software testing according to the widely accepted methodology [13, 31, 32].

The algorithm consists of the following main parts: computation of the selection criterion (3), selection and subdivision of the hyper rectangle, and storing of the new hyper rectangles produced. The computational burden needed for the implementation of the current iteration of algorithm depends on the dimensionality d and on the number of iteration. The dimensionality influences computing time of the criterion (3). The amount of stored information grows with the number of iteration, and correspondingly grows computer time needed for its management. Therefore, for the problems of higher dimensionality, the parallel implementation of the algorithm is planned. Meanwhile, the numerical results below are obtained using the straightforward C++ implementation.

The computation of criterion (3) reduces to the numerical integration over the hyper rectangle. In the two-dimensional case the integral can be computed using a combination of analytic and numerical techniques. Let \(\mathbf p ^i=(p^i_0, p^i_1)\), \(i=0, \dots , 3,\) be the rectangle vertices, numbered such that \(p^0_0 = p^1_0 < p^2_0 = p^3_0\) and \(p^0_1=p^2_1 < p^1_1=p^3_1\). The corresponding objective function values are then \(f_i = f(\mathbf p ^i), i=0, \dots , 3\). The bi-linear interpolation of the objective value at an arbitrary point \(\mathbf x =(x_0, x_1)\), \(x_j \in [a_j, b_j], a_j=p^0_j, b_j=p^3_j, j =0, 1, \) can then be expressed as

$$\begin{aligned}&L(\mathbf x )=L(x_0, x_1)=A(x_1)x_0 + B(x_1),\nonumber \\&A(x_1)=\frac{\frac{x_1-a_1}{b_1-a_1}(f_0-f_1-f_2+f_3)+(f_2-f_0)}{b_0-a_0},\nonumber \\&B(x_1)=\frac{\frac{x_1-a_1}{b_1-a_1}(-f_0p^3_0 + f_1p^2_0 + f_2p^1_0 -f_3p^0_0) + (f_0p^3_0-f_2p^1_0)}{b_0-a_0}. \end{aligned}$$
(17)

Then

$$\begin{aligned}&\rho ^n_i=\int _{R_i}\frac{d\mathbf x }{L(\mathbf x ) - M_n + g(v_n)} = \int _{a_1}^{b_1}\int _{a_0}^{b_0} \frac{dx_0}{L(x_0, x_1) - M_n + g(v_n)} dx_1 \nonumber \\&\quad =\int _{a_1}^{b_1}\int _{a_0}^{b_0} \frac{dx_0}{A(x_1)x_0 + C(x_1)} dx_1 = \int _{a_1}^{b_1} u(x_1) dx_1, \end{aligned}$$
(18)

where

$$\begin{aligned} u(x_1)= & {} {\left\{ \begin{array}{ll}{ll} \frac{1}{A(x_1)}(\ln (A(x_1)b_0+C(x_1))-\ln (A(x_1)a_0+C(x_1))), &{}\quad \text {if}\; A(x_1)\ne 0,\\ \frac{1}{C(x_1)}(b_0-a_0),&{}\quad \text {if}\; A(x_1) = 0, \end{array}\right. }\nonumber \\ C(x_1)= & {} B(x_1)-M_n+g(v_n). \end{aligned}$$
(19)

Note that the function \(u(x_1)\) is continuous at the point \(\hat{x}_1=\frac{f_0-f_2}{f_0-f_1-f_2+f_3}\), \(\hat{x}_1 \in [a_1, b_1],\) that gives \(A(\hat{x}_1)=0\).

In the experiments for the two dimensional case we compute \(\rho ^n_i\) as a single dimensional integral using Eqs. 1719. The single dimensional integral is computed numerically using the GNU Scientific Library [33] routine gsl_integration_cquad requiring a relative error smaller than \(10^{-7}\). We call the resulting implementation Rect-A. For comparison we also include the integration over the rectangle \(R_i\) using the classical Monte Carlo method (GNU Scientific Library routine gsl_monte_plain_integrate) with \(M=10{,}000\) evaluations of the two dimensional integrand (referred to as Rect-MC). Finally, we present a simplified version of computing \(\rho ^n_i\) for dimension \(d=2\) as

$$\begin{aligned} \rho ^n_i\approx \frac{|R_i|}{L(\mathbf x ) - M_n + g(v_n)},x_j=(a_j+b_j)/2, j=0, \dots , d. \end{aligned}$$
(20)

We call this algorithm version Rect-1. In all implementations the function evaluation database is employed, so that the objective is never evaluated twice at the same location.

For comparison, two following algorithms were used. A recent algorithm [21], based on the Delaunay triangulation of the optimization region and referred to as Simpl in this section, is also implemented in the C++ programming language and uses the 2D and 3D Delaunay triangulation code from the CGAL library [34]. The comparison with a non-adaptive algorithm presents interest especially because of mentioned in the introduction different convergence rates. The non-adaptive algorithm is implemented using the so called \(LP_{\tau }\) sequences [35], and is worst case quasi optimal since it generates points for computing objective function values nearly uniformly in the feasible region. The results of that algorithm are referred to as Sobol.

We start by illustrating the partition of the feasible regions by the algorithms Rect-A during minimization of the following objective functions:

  • The six-hump camel back function [36]:

    $$\begin{aligned} f(\mathbf x )&= \left( 4-2.1x_1^2 + x_1^4/3\right) x_1^2 + x_1x_2+ (-\,4+4x_2^2)x_2^2,\nonumber \\&\quad \mathbf x \in [-3, 3]\times [-\,2, 2]. \end{aligned}$$
    (21)
  • The Branin function [37]:

    $$\begin{aligned} f(\mathbf x )&= \left( x_2-\frac{5.1}{4\pi ^2}x_1^2+5\frac{x_1}{\pi }-6\right) ^2+ 10\left( 1- \frac{1}{8\pi }\right) \cos (x_1)+10,\nonumber \\&\quad \mathbf x \in [-\,5, 10]\times [0, 15]. \end{aligned}$$
    (22)

Figure 1 shows the partition after \(n=3000\) objective function evaluations. It can be seen that the space is roughly explored everywhere, however, the trial points are positioned more densely in the regions of potential global minima.

Fig. 1
figure 1

Examples of the partition of feasible region for the six-hump camel back function (left) and Branin (right) test functions after \(n=3000\) function evaluations

Next, we present the numerical experiments aimed at the comparison of the proposed algorithm with the algorithm of similar destination which implements the partition based on the Delaunay triangulation of the feasible region [21]. The latter was tested using a set of 2-dimensional objective functions from [38], that were used as well as in numerous other papers investigating Lipschitz optimization algorithms.

Table 1 presents results of minimization of test functions from [38] using the same numeration. The optimization error is presented (on a logarithmic scale) as \(\delta _n=-\log _{10}(M_n-f_{min})\), after \(n=1000, 3000\) function evaluations for each of the algorithms Rect-A, Rect-1, Rect-MC, Simpl [21] and Sobol. Symbol \(\infty \) means that zero error has been achieved within machine precision. The numbers in the left column have been copied from the function definitions in [38] on p. 468. The values for the algorithm [21] have been copied from the Table 1 of the paper [21].

It can be seen that the various versions of the proposed algorithm perform comparably with the algorithm [21]. Moreover, there appears to be no significant difference between implementations Rect-A and Rect-MC, suggesting that the Monte Carlo technique could be used to approximate the integral in the higher dimensions. As could be expected, the results of the non-adaptive Sobol algorithm are the worst. The minimization of the function No. 13 was similarly problematic to all algorithms, as the global minimum of this function is hidden in a small basin behind a sharp spike.

Table 1 Testing results with functions from [38] defined over a 2-dimensional region

The minimization results for objective functions defined over 3-dimensional regions are shown in Table 2. The functions are also taken from [38]. Since the analytic expression of the integral is not available in the 3-dimensional case, we included only versions Rect-1 and Rect-MC (\(M=100000\) evaluations of the integrand per hyper-rectangle) of the proposed algorithm. The table shows that the proposed algorithm performs better than other algorithms.

Table 2 Testing results with functions from [38] defined over a 3-dimensional region

The number of iteration needed to find an appropriate approximation of minimum can depend on a coincidental match of properties of an algorithm and objective function. To reduce that factor, the performance profiles below are evaluated using a sample of randomly generated problems where the same objective function is minimized over a feasible region which is randomly (with uniform distribution) scrolled for 10% of its original extend. The average and maximum errors of 100 random problems were recorded and plotted against number of function evaluations n. The logarithmic scale for the presentation of graphs was used: \(\delta _n=-\,\log _{10}(M_n-f_{min})\), where \(M_n\) denotes the best function value found up to the \(n\mathrm{th}\) function evaluation and \(f_{min}\) is the actual global minimum value. In Fig. 2 are presented graphs for (2-dimensional) six-hump camel back and Branin functions; the version of algorithm Rect-A was used, partition of the feasible region is shown in Fig. 1.

Fig. 2
figure 2

The maximum and average error after n function evaluations of algorithms Rect-A and Sobol with 100 perturbed cases of the 2-dimensional test problems: the six-hump camel back function (left) and the Branin function (right)

Fig. 3
figure 3

The maximum and average error after n function evaluations of algorithms Rect-1 and Sobol with 100 perturbed cases of the Rastrigin test problem (Eq. 23) for \(d=2, 3,4,5\)

To illustrate the influence of the dimension similar experiments were performed with test problems up to dimension 5. Let us note, that contemporary publications on algorithms of similar destination, i.e supposed for expensive black box type problems, typically present experiments with objective functions up to dimensionality five, six [16, 39]. The test problems were constructed extending the two dimensional Rastrigin function [15] for dimension d:

$$\begin{aligned} f(\mathbf x ) = \sum _{i=1}^{d}x_i^2-\cos (\pi x_1) - \cos (\pi x_2),\ \ \ \mathbf x \in [-\,2, 2]^d. \end{aligned}$$
(23)

Such an extension keeps fixed number of local minima equal to 5. The global minimum point is the d dimensional vector with zero components. The results for \(d=2,3,4,5\) are presented in Fig.  3, and are also quantified in the Table 3.

Now we can compare the upper bound given by Theorem 1 with the actual error estimated experimentally. Our analysis uses the inequality \(v_n\le 1/n\) (\(v_n\) is the volume of the smallest rectangle after n iterations), for example in the last inequalities in the proof of Lemma 4. If f is near its minimum over much of the domain, then \(v_n\approx 1/n\) (think of a nearly constant function with a tiny, locally quadratic valley in the center of the domain). For our test functions, \(v_n\) is much smaller than 1 / n. We computed the characteristics of the test function at (23), and arrived at a value of \(\beta = 2.05 \times 10^{-4}\) for dimension \(d=5\). For \(n=60{,}000\), this gives an upper bound of \(-\,0.02\), while the corresponding number is less than \(-\,5\) in the last graph of Fig. 3 (the difference is even larger if we take into account the factor in front of the exponential term).

Table 3 The average/maximum number of objective function evaluations, performed by algorithms Rect-1 and Sobol, before \(-\,\delta _n\)drops below t, for the 100 perturbed cases of the Rastrigin test problem (Eq. 23) in dimensions \(d=2,3,4,5\)