1 Introduction

Optimization algorithms aim to find the best solution from a set of solutions for a problem with a(n) (un)constrained objective function. Many decision-making problems for real-world systems can be formulated as optimization problems of objective functions of a set of decision variables modeling and describing such systems. These functions are often multi-variate, non-differentiable, multi-modal, and tedious to be evaluated (see, e.g., [6, 17, 50]), which makes optimizing them exceptionally difficult. Optimization problems have been a recurring topic for centuries and many numerical techniques to solve them have been proposed and widely discussed in the literature (see, e.g., [26, 41, 42]).

In some optimization problems, decision variables take values constrained in a given range. Such problems, often referred to as bound-constrained optimization problems, play a key role in the design of algorithms for general optimization problems because many of these algorithms reduce their solutions to the solution of a sequence of bound-constrained problems. Moreover, bound-constrained optimization problems are present in several practical applications as the decision variables of many real-world systems are often bounded by physical limits [3, 48].

In all these algorithms, certain assumptions are made about the objective function being optimized (e.g., its continuity or differentiability). However, these assumptions are not necessarily satisfied by real-world objective functions and often they are impossible to verify. Sometimes, the function is only available through a black box, where one provides a point, representing a candidate solution (a realization of the decision variables), to the black box, which, in return, evaluates the function at that point and gives the corresponding function value. In other words, the only source of information about the function becomes the set of evaluations performed through the black box. Optimizing such functions is often referred to as a black-box optimization problem. Usually, a function evaluation requires computational resources, and hence black-box optimization seeks a trade-off between the quality of a solution found versus the evaluation budget, namely the number of function evaluations (i.e., the amount of computational resources used). Besides, most of real-world objective functions are multi-modal for which one might be interested in finding a global or local optimum. Global optimization algorithms are guaranteed to find the global optimum given enough computational resources, whereas local optimization algorithms are only guaranteed to find a local optimum, regardless of how many computational resources are used. Therefore, it is common to quantify the quality of the found solution as a function of the function evaluation budget.

Consider the bound-constrained black-box global optimization problem (BCBBGOP) stated as:Footnote 1

$$\begin{aligned} \begin{array}{llll} &{} {\text {maximize}} &{} &{} f(x) \\ &{} \text {subject to} &{} &{} x \in {\mathcal {X}}, \;{\mathcal {X}} = \{x \in {\mathbb {R}}^N: l\le x\le u\} \end{array} \end{aligned}$$
(1)

using a computational budget of n function evaluations, where \(f:{\mathcal {X}}\subset {\mathbb {R}}^N\rightarrow {\mathbb {R}}\) is a black-box function on which no information is available except the objective function values. Let \(x^*\) be the (or one) solution to (1) such that \(f^* = f(x^*)\); an algorithm for finding any solution \(x^*\), would be one of two types: passive, or sequential. A passive algorithm \({\mathcal {A}}_p\) would use its n-evaluation budget at once to evaluate a uniform grid of n points in the objective function’s domain \({\mathcal {X}}\) (also called the search space). \({\mathcal {A}}_p\) then returns x(n), the point in the grid with the best function value among the grid points, as a guess on \(x^*\) (see, e.g., [13]). Alternatively, a sequential algorithm \({\mathcal {A}}_s\) would start by arbitrarily guessing/selecting a point in the search space and evaluating f at it. \({\mathcal {A}}_s\) then iteratively refining its next guess based on the previously selected points and their corresponding function f values. After n evaluations, \({\mathcal {A}}_s\) returns x(n): its best guess on \(x^*\) (see, e.g., [43]). Since there is no prior knowledge on f, there is no guarantee that x(n) of either \({\mathcal {A}}_p\) nor \({\mathcal {A}}_s\) is the solution to the BCBBGOP of (1). While passive algorithms are simple and easy, sequential algorithms can achieve solutions of better quality given the same computational budget. It was shown independently by Ivanov [28] and Sukharev [63] that the evaluation budget needed to find a solution with a desired degree of quality is the same for the best passive and sequential algorithms in the worst case for some functions, whereas for most other functions, the required evlaution budget for a passive algorithm is much greater than that of a sequential algorithm.

A sequential algorithm is performing an n-round sequential decision-making process under uncertainty where a future action (selecting a candidate solution) depends on past actions (previously selected points) and their rewards (observed function values). A principal question is: how can \({\mathcal {A}}\) identify as quickly as possible the most rewarding action to select, next. Intuitively, \({\mathcal {A}}_s\) would explore the set of possible actions it can take to know more about f and as its knowledge improves, \({\mathcal {A}}_s\) should increasingly exploit its current knowledge by selecting what it believes to be the best action. Clearly, the trade-off between exploration and exploitation affects \({\mathcal {A}}_s\)’s returned solution. The quality of \({\mathcal {A}}_s\)’s returned solution x(n), as a function of the number of function evaluations n, is evaluated by the following regret (or loss) measure:

$$\begin{aligned} r(n)= f^* - f(x(n)) \end{aligned}$$
(2)

Such measure also serves as an indicator of \({\mathcal {A}}_s\)’s convergence rate with respect to n, as it gauges how efficient \({\mathcal {A}}_s\) is, in capturing enough information about f to produce a solution with a desired degree of accuracy.

The literature on sequential decision-making problems under uncertainty is rich and large with applicability in various scientific fields (e.g., machine learning, planning, and control). Initial investigations on such problems date back to Thompson in 1933 [64] and Robbins in 1952 [49], where they were known as multi-armed bandit problems. Since then, much progress have been made addressing the exploration vs. exploitation dilemma. Many strategies were proposed, studied, and analyzed; such as \(\epsilon \)-greedy [2], softmax [5], bayesian [57], upper-confidence-bound [2], and optimistic [33] strategies. With respect to BCBBGOPs, Piyavskii [43], Shubert [56], and Strongin [59, 60] are considered to be the pioneers in applying the framework of sequential decision making to mathematical optimization. Their work has become the basis on which a large existing body of research relies (see e.g., [25, 36, 42]). A general discussion on sequential algorithms for global optimization has been presented by Archetti and Betrò [1].

In this paper, we address a class of sequential decision-making algorithms that are designed to solve a BCBBGOP via carrying out a hierarchical partitioning of f’s domain \({\mathcal {X}}\) in search for \(x^*\). An algorithm of such a class adopts a divide-and-conquer tree search for \(x^*\) by partitioning the search space \({\mathcal {X}}\) in a hierarchical fashion as a part of its n-round sequential decision making. Iteratively, it carries out a dynamic expansion of a tree whose nodes represent subspaces of \({\mathcal {X}}\); with the root corresponding to the entire search space \({\mathcal {X}}\). This creates a set of partitions of \({\mathcal {X}}\) over multiple scales (nodes of a specific depth corresponds to a partition of a specific scale). We refer to this as Multi-Scale Optimization (MSO). The motivation behind introducing the MSO framework is of two folds: first, tree-search algorithms have increasingly been gaining popularity (e.g., game of GO [7]); and second, while there is a great deal of literature on tree-search algorithms in various fields (e.g., artificial agents control and planning [4]), where they have been extensively analyzed; minimal research has been carried out in the context of optimization. Several state-of-the-art hierarchical partitioning algorithms for BCBBGOPs have been proposed and analyzed independently (see, e.g., [27, 29, 38, 56]). Nonetheless, the theoretical analysis has been primarily concerned with the asymptotic (limit) behavior of the algorithms and their convergence to optimal points (see, e.g., [16, 52, 65]). Here, we study the finite-time analysis and convergence rate under the MSO framework.

We present a theoretical methodology to analyze MSO algorithms, based on three assumptions on the objective function structure, and the hierarchical partitioning. Such methodology quantifies the amount of exploration conducted by an MSO algorithm over a partition of a specific scale. Consequently, a bound on the regret (2) can be established, which reflects the finite-time behavior of the algorithm. Using these theoretical findings, we establish finite-time analysis for Lipschitzian optimization methods [56], DIviding RECTangles (DIRECT) [29], and Multi-level Coordinate Search (MCS) [27]. Furthermore, we integrate the analysis of Deterministic Optimistic Optimization (DOO) and Simultaneous Optimistic Optimization (SOO) in [38] under the MSO framework.

This paper is organized as follows. Basic notations and terminology are introduced in Sect. 2. In Sect. 3, a formal introduction of MSO framework to solve BCBBGOPs is provided. Towards the end of the section, some well-known MSO algorithms are discussed. In Sect. 4, we propose a principled approach for analyzing MSO algorithms theoretically and demonstrate it on the algorithms discussed in Sect. 3. Section 5 outlines the theoretical contribution and complements it with empirical validation. Towards the end, Sect. 6 summarizes the conclusions from this study.

2 Some preliminaries

In this section, using some of the concepts from [9], we formally define the hierarchy of partitions and the data structure (tree), on the search space \({\mathcal {X}}\) employed by an MSO algorithm.

We denote by \(2^{\mathcal {X}}\) the set of all subsets (subspaces, or cells) of \({\mathcal {X}}\). The size/volume of a subset in \(2^{\mathcal {X}}\) is approximated by the function \(\sigma :2^{\mathcal {X}}\rightarrow {\mathbb {R}}^+\). Two elements \({\mathcal {X}}_i, {\mathcal {X}}_j\) of \(2^{\mathcal {X}}\) are said to be disjoint if and only if

$$\begin{aligned} {\mathcal {X}}_i \cap {\mathcal {X}}_j = \beta ({\mathcal {X}}_{i}) \cap \beta ({\mathcal {X}}_{j}) \end{aligned}$$
(3)

Here, \(\beta ({\mathcal {X}}_i)\) denotes the boundary of \({\mathcal {X}}_i\). A subset of \(2^{\mathcal {X}}\) is called a partial partition of \({\mathcal {X}}\) if its elements are disjoint and nonempty. A union of a partial partition is called its support. A partition of \({\mathcal {X}}\) is a partial partition whose support is \({\mathcal {X}}\). A set \(\mathcal {G} \subseteq 2^{{\mathcal {X}}}\) is a hierarchy on \({\mathcal {X}}\) if any two elements of \(\mathcal {G}\) are either disjoint or nested, i.e.:

$$\begin{aligned} {\mathcal {X}}_i \cap {\mathcal {X}}_j \in \{\beta ({\mathcal {X}}_{i}) \cap \beta ({\mathcal {X}}_{j}), {\mathcal {X}}_i,{\mathcal {X}}_j\} \ \text {for any } {\mathcal {X}}_i,{\mathcal {X}}_j \in \mathcal {G} \end{aligned}$$
(4)

Let X and Y be two distinctive elements of a hierarchy \(\mathcal {G}\). We say that Y is a child of X and X is the parent of Y if \(Y\subseteq X\) for any \(Z \in \mathcal {G}\), such that \(Y\subseteq Z\subseteq X\), we have \(Z=X\) or \(Z=Y\). In other words, X is the smallest superset of Y among \(\mathcal {G}\) elements. A partition factor \(K \in {\mathbb {Z}}^+\) of a hierarchy \(\mathcal {G}\) is the maximum number of children of a parent in \(\mathcal {G}\). An element \(L \in \mathcal {G}\) is called a leaf of \(\mathcal {G}\) if it has no child.

Now, a hierarchy of partitions on \({\mathcal {X}}\) is formally defined as:

Definition 1

A hierarchy \(\mathcal {G}\) on \({\mathcal {X}}\) is a hierarchy of partitions on \({\mathcal {X}}\) if the union of its leaves is a partition of \({\mathcal {X}}\) and \({\mathcal {X}}\in \mathcal {G}\).

The term multi-scale in MSO is derived from the fact that a hierarchy of partitions has a set of partitions at multiple scales \(h\in {\mathbb {Z}}^+_0\). Let \(\mathcal {G}\) be a hierarchy of partitions on \({\mathcal {X}}\) created by an MSO algorithm. A hierarchy of partitions \(\mathcal {G}\) may be represented by a tree structure \({\mathcal {T}}(\mathcal {G})\) whose nodes correspond to the elements of \(\mathcal {G}\) with the root representing the whole space \({\mathcal {X}}\), while its edges link every node corresponding to a child in \(\mathcal {G}\) to its corresponding parent’s node. \({\mathcal {T}}\) is referred to as a tree on \({\mathcal {X}}\) if it represents a hierarchy of partitions on \({\mathcal {X}}\). It is possible to index a node in \({\mathcal {T}}\) (and subsequently a cell of \({\mathcal {X}}\)) by one or more integer attribute(s). For example, with a partition factor of K, a node can be indexed by its depth/scale h and an index i where \(0\le i \le K^h\) as (hi) which corresponds to a cell/subspace \({\mathcal {X}}_{h,i} \subset {\mathcal {X}}\) and possesses up to K children nodes \(\{(h+1,i_k)\}_{1\le k \le K}\) such that:

$$\begin{aligned} {\mathcal {X}}= & {} \cup _{i\in \{0,\ldots ,K^h-1\}} {\mathcal {X}}_{h,i}\end{aligned}$$
(5)
$$\begin{aligned} {\mathcal {X}}_{h,i} \cap {\mathcal {X}}_{h,j}= & {} \beta ({\mathcal {X}}_{h,i}) \cap \beta ({\mathcal {X}}_{h,j}),\quad \ i\ne j\end{aligned}$$
(6)
$$\begin{aligned} {\mathcal {X}}_{h,i}= & {} \cup _{1\le k \le K} {\mathcal {X}}_{h+1,i_k},\quad \ \forall h\in {\mathbb {Z}}^+_0 \end{aligned}$$
(7)

Nodes can be indexed and grouped by any of their attributes such as depth, and size. For example, \(S=\{i\in {\mathcal {T}}:\sigma (i)=s\}\) is the set of all nodes in the tree \({\mathcal {T}}\) of size s. Note that for any two elements \(i,j \in S, i\cap j = \beta (i) \cap \beta (j)\). The set of leaves in \({\mathcal {T}}\) is denoted as \({\mathcal {L}}_{{\mathcal {T}}} \subseteq {\mathcal {T}}\) and its depth is denoted by \(\texttt {depth}({\mathcal {T}})\).

3 Multi-scale optimization (MSO)

In this section, a general framework for sequential decision-making algorithms that dynamically expand a tree on the search space to find the optimal solution, is introduced. We refer to this framework as the Multi-Scale Optimization MSO, as these algorithms partition the search space over multiple scales. We formally define MSO algorithm as follows.

Definition 2

An algorithm that constructs a hierarchy of partitions on the search space \({\mathcal {X}}\) whilst looking for f’s global optimizer, is an MSO algorithm.

MSO algorithms differ only in their strategies of growing and using the tree further to provide a good approximation of f’s global maximizer point. Hence, we introduce a generic procedure of MSO algorithms.

3.1 A generic procedure of MSO algorithms

An MSO algorithm can be regarded as a divide-and-conquer tree search algorithm. In an iterative manner: it evaluates and assesses a set of leaf nodes of its tree on \({\mathcal {X}}\); and selectively expands a subset of them.

Each node provides its approximate solution \(x_{h,i} \in {\mathcal {X}}_{h,i}\) which is referred to as (hi)’s representative state (or sometimes base point). Let \({\mathcal {A}}\) be an MSO algorithm with up to J function evaluations per node, P evaluated nodes per iteration, Q subdivided/expanded nodes per iteration, and a partition factor K. Furthermore, let us denote \({\mathcal {A}}\)’s tree \({\mathcal {T}}\) after iteration \(t\ge 1\) by \({\mathcal {T}}_t\).

Remark 1

Node expansions make the tree \({\mathcal {T}}\) grow with time. We may therefore index different forms of \({\mathcal {T}}\) by an index t to denote \({\mathcal {T}}\)’s form \({\mathcal {T}}_{t+1}\) after a specific event with respect to its previous form \({\mathcal {T}}_{t}\). An event could be a single/several node(s) expansion, or a function evaluation (for which \({\mathcal {T}}_{t+1}= {\mathcal {T}}_{t}\)). Expanding Q nodes \(\{(h^q,i^q)\}_{1\le q \le Q}\in {\mathcal {L}}_t\) at time t, where \((h^q,i^q)\) is the qth expanded node, results in:

  1. 1.

    \({\mathcal {T}}_{t+1}= {\mathcal {T}}_t \cup \{(h^q+1,i^q_k)\}_{1\le q \le Q,1\le k \le K}\)

  2. 2.

    \({\mathcal {L}}_{t+1}= {\mathcal {L}}_t\setminus \{(h^q,i^q)\}_{1\le q \le Q} \cup \{(h^q+1,i^q_k)\}_{1\le q \le Q,1\le k \le K}\)

  3. 3.

    \(|{\mathcal {T}}_{t+1}|=|{\mathcal {T}}_t|+QK\)

  4. 4.

    \(|{\mathcal {L}}_{t+1}|=|{\mathcal {L}}_t|+Q(K-1)\)

  5. 5.

    \(H({\mathcal {T}}_{t+1})= \max (\texttt {depth}({\mathcal {T}}_{t}), max_{1\le q \le Q}\ h^q+1)\)

At iteration \(t+1\), two steps take place, namely evaluation and expansion illustrated as follows.Footnote 2

  1. 1.

    Leaf Node(s) Evaluation In order to find \(x_{h,i}, J\ge 1\) function evaluations are performed within \({\mathcal {X}}_{h,i}\) as a part of the sequential decision making process and out of the n-evaluation budget. These J function evaluations of a leaf node may be independent of its ancestors’ and may not happen at the same iteration. We refer to the process of evaluating f within \({\mathcal {X}}_{h,i}\) J times as evaluating the node (hi). Leaf nodes with function evaluations less than J are referred to as under-evaualted nodes. Otherwise, they are called evaluated nodes. The set of evaluated nodes up to iteration t (inclusive) are denoted as \(\mathcal {E}_{t}\subseteq {\mathcal {L}}_{{\mathcal {T}}_{t-1}}\). At each iteration, \({\mathcal {A}}\) selects \(P\ge 1\) under-evaluated nodes (if any) to be evaluated. Denote the set of selected-to-be-evaluated nodes as \(\mathcal {P}_{t+1}\mathop {=}\limits ^{\mathrm{def}}\cup _{h\in \{0,\ldots ,\texttt {depth}({\mathcal {T}}_t)\}} \mathcal {P}_{t+1,h} \subseteq {\mathcal {L}}_{{\mathcal {T}}_t} \setminus \mathcal {E}_{t}\) where:

    $$\begin{aligned}&\mathcal {P}_{t+1,h} \mathop {=}\limits ^{\mathrm{def}}\{(h,i): 0 \le i \le K^h-1,\nonumber \\&\quad (h,i)\text {is evaluable at}\,t+1\, \text {according to}\, {\mathcal {A}}\} \end{aligned}$$
    (8)

    and \(|\mathcal {P}_{t+1}|\le P\).

  2. 2.

    Leaf Node(s) Expansion \({\mathcal {A}}\) inspects the evaluated leaf nodes (if any), and selects \(Q\ge 1\) among them to be split/partitioned. These selected nodes represent the sub-domain in which \({\mathcal {A}}\) thinks \(x^*\) potentially lies and hence a finer search is favored. We refer to the process of splitting/ partitioning a leaf node (hi) into its K children as expanding the node (hi). Denote the set of selected-to-be-expanded nodes as \({\mathcal {Q}}_{t+1}\mathop {=}\limits ^{\mathrm{def}}\cup _{h\in \{0,\ldots ,\texttt {depth}({\mathcal {T}}_t)\}} {\mathcal {Q}}_{t+1,h} \subseteq \mathcal {E}_{t+1}\subseteq {\mathcal {L}}_{{\mathcal {T}}_{t}}\) where:

    $$\begin{aligned}&{\mathcal {Q}}_{t+1,h} \mathop {=}\limits ^{\mathrm{def}}\{(h,i): 0 \le i \le K^h-1,\nonumber \\&\quad (h,i)\,\text {is expandable at}\,t+1\,\text {according to}\, {\mathcal {A}}\} \end{aligned}$$
    (9)

    and \(|{\mathcal {Q}}_{t+1}|\le Q\).

After n function evaluations, \({\mathcal {A}}\) returns x(n):

$$\begin{aligned} x(n)\in \arg \max _{x_{h,i}:(h,i)\in {\mathcal {T}}} f(x_{h,i}) \end{aligned}$$
(10)

as an approximate of f’s global maximizer point.Footnote 3 This procedure is summarized in Algorithm 1.

figure a

One can have different MSO algorithms based on the defining policies of the evaluable set \(\mathcal {P}\) and the expandable set \({\mathcal {Q}}\). In fact, an MSO algorithm \({\mathcal {A}}\) seeks a balance between two components of search [46]: exploration, and exploitation. Exploration (or global search) refers to the process of learning more about the search space. On the other hand, exploitation (or local search) is the process of acting optimally according to current knowledge. Given a finite computational budget, excessive exploration (e.g., an algorithm with a broad-search tree) leads to slow convergence, whereas excessive exploitation (e.g., an algorithm with a deep-search tree) leads to pre-mature convergence to a local maximum and hence a trade-off must be made. Accordingly, \({\mathcal {A}}\) chooses \(\mathcal {P}\) and \({\mathcal {Q}}\) to be of leaf nodes of its tree \({\mathcal {T}}\) that preferably possess jointly good exploration and exploitation \({\mathcal {A}}\)-defined scores.

Let \(l:{\mathcal {L}}\rightarrow {\mathbb {R}}\) be the exploitation score function and \(g:{\mathcal {L}}\rightarrow {\mathbb {R}}\) be the exploration score function. The projection of a node (hi) onto the exploitation axis is then denoted as \(l_{h,i}=l((h,i))\) whereas its projection onto the exploration axis is denoted as \(g_{h,i}=g((h,i))\).

Exploitation (Local) Score As the function value at (hi)’s base point, \(f(x_{h,i})\) is the best value of f within \({\mathcal {X}}_{h,i}\) according to \({\mathcal {A}}\)’s current belief. It is a direct indicator of (hi)’s exploitation (local) score. Therefore, \(l_{h,i}\) is taken as \(f(x_{h,i})\) or its approximate (if it is unavailable) with an absolute error less than or equal to \(\eta \ge 0\) :Footnote 4

$$\begin{aligned} |l_{h,i}-f(x_{h,i})|\le \eta \end{aligned}$$
(11)

Exploration (Global) Score While \(l_{h,i}\) reflects \({\mathcal {A}}\)’s guess on the best value of f within \({\mathcal {X}}_{h,i}, (h,i)\)’s exploration (global) score \(g_{h,i}\) is regarded as the likelihood of finding a better value than \(f(x_{h,i})\) within \({\mathcal {X}}_{h,i}\). This is correlated with the bulk of unexplored space in \({\mathcal {X}}_{h,i}\) and often quantified by a rough measure of its size. For example, \(g_{h,i}=\texttt {depth}({\mathcal {T}})-h\) or \(g_{h,i}=\sigma ({\mathcal {X}}_{h,i})\). Hence, one can argue that nodes of the same depth have the same exploration score or may differ up to a certain limit \(\zeta \ge 0\):

$$\begin{aligned} |g_{h,i}-g_{h,j}|\le \zeta \ \ \ \forall h\in {\mathbb {Z}}^+_0,\ i, j \in \{0,\ldots ,K^h-1\},\quad \ i\ne j \end{aligned}$$
(12)

In the exploration-exploitation plane, each node (hi) of \({\mathcal {L}}\) is represented by the point \((g_{h,i},l_{h,i})\). Let Y be the set of these points. In other words, \(Y\mathop {=}\limits ^{\mathrm{def}}\{(g(x),l(x)):x\in {\mathcal {L}}\}\). \({\mathcal {A}}\) defines a measure-of-optimality function \(b:{\mathbb {R}}^2\rightarrow {\mathbb {R}}\) such that \(\mathcal {P}\) and \({\mathcal {Q}}\) are chosen from the set of leaf nodes whose exploration-exploitation points are the potentially optimal set \(P(Y)\mathop {=}\limits ^{\mathrm{def}}\{y\in Y:\{x\in Y:b(x) > b(y), x\ne y\}=\emptyset \}\). In other words, P(Y) is the set of points \(\subseteq Y\) whose level set of b (b-value) corresponds to the highest value among all points of Y. Generally, b is a weighted sum of l and g of the form:

$$\begin{aligned} b_x =b((g(x),l(x))) = l(x) + \lambda \cdot g(x), \ \ x \in {\mathcal {L}}, \ \lambda \ge 0 \end{aligned}$$
(13)

trading off between local and global searches. It is important to note that glb used for \(\mathcal {P}\) may not be the same as for \({\mathcal {Q}}\).

Fig. 1
figure 1

The process of computing exploration and exploitation scores for the set of leaf nodes \({\mathcal {L}} \subseteq {\mathcal {T}}\) (a) of an MSO algorithm \({\mathcal {A}}\) can be regarded as projecting them onto a 2-D Euclidean space (b) \({\mathbb {R}}^2\) via the mapping function \(s:{\mathcal {L}}\rightarrow {\mathbb {R}}^2\). Y represents \({\mathcal {L}}\)’s image under s and \(P(Y)\subseteq Y\) is the potentially optimal set. Here P(Y) lies on the level set of \(b(x)=l_x+\lambda \cdot g_x\) that corresponds to the value 4 (the greatest among Y’s). \(\mathcal {P}\) and \({\mathcal {Q}}\) are chosen from the set of leaf nodes whose image under s is P(Y)

We can visualize the process of computing these scores as projecting the leaf nodes of \({\mathcal {T}}\) (depicted in Fig. 1a) onto a 2-D Euclidean space \({\mathbb {R}}^2\), as illustrated in Fig. 1b, whose vertical and horizontal coordinate axes represent the domain of values for exploitation score and exploration score, respectively. From Fig. 1b, \(l_{h,i}\) (11) for the node \((4,6), g_{h,i}\) (12) for nodes at depth 4, and P(Y) for \(\lambda =1\) are shown.

Remark 2

One can introduce heuristics to consider, compute, and compare the b-values only for a subset of \({\mathcal {L}}\) at a given iteration from which \(\mathcal {P}\) and \({\mathcal {Q}}\) are chosen rather than the whole set \({\mathcal {L}}\).

3.2 MSO Algorithms in the literature

In the literature, several established algorithms (such as Lipschitzian Optimization (LO), Dividing Rectangles (DIRECT), Multi-level Coordinate Search (MCS), Deterministic Optimistic Optimization (DOO), Simultaneous Optimistic Optimization (SOO), and their variants) satisfy the definition of MSO algorithms.

These algorithms can be grouped into two categories: one category requires the knowledge about f’s smoothness, with the b-values being a weighted sum of the local and global scores, LO and DOO are examples of such a category; on the other hand, DIRECT, MCS, and SOO still make an assumption about f’s smoothness, but knowledge about it may not be available. Nevertheless, these algorithms account for more than one possible setting of f’s smoothness by grouping nodes based on their global scores; local scores play a role in analyzing each group separately. We describe algorithms in these two categories in accordance with the generic procedure discussed in Sect. 3.1.

3.2.1 Lipschitzian optimization (LO)

In 1972, Piyavskii [43] and Shubert [56] independently proposed Lipschitzian optimization (LO). It has been extended to multi-dimensional problems in [41]. LO assumes that f satisfies the Lipschitz condition:

$$\begin{aligned} |f(x)-f(y)| \le L ||x-y||, \ \forall x,y \in {\mathcal {X}} \end{aligned}$$
(14)

where L, a positive real number, is the Lipschitz constant.Footnote 5 For a maximization problem, LO begins by evaluating the extreme points of the search space (e.g. vertices of \(l\le x \le u\) for (1)) and constructs a piece-wise linear estimator \({\hat{f}}\) upper bounding f, by employing the Lipschitz condition (14). The maximum point of \({\hat{f}}\) represents the current estimator’s best guess on where \(x^*\) lies. Furthermore, LO uses this point as a partitioning point of the search space into multiple subspaces on which the process of constructing an upper bound is applied recursively. In essence, LO iteratively refines a piece-wise linear estimator \({\hat{f}}\) to guide its hierarchical partitioning of the search space towards \(x^*\) where the next partitioning (sample) point is the optimum according to \({\hat{f}}\).

With respect to the generic procedure outlined in Sect.  3.1, LO has the following settings: J is of \(\mathcal {O}(2^N)\); \(K=P=2^N\), that is leaf nodes created in an iteration get evaluated in the next iteration; and \(Q=1\). The two steps of LO at iteration \(t+1\) are summarized as the following:

  1. 1.

    Leaf Node(s) Evaluation \(\mathcal {P}_{t+1}\) is \({\mathcal {L}}_{{\mathcal {T}}_t}\setminus \mathcal {E}_{t}\), i.e., the set of leaf nodes that are not evaluated. For a node \((h,i)\in \mathcal {P}_{t+1}, 2^N\) function evaluations- at (hi)’s vertices- are performed (hence (hi) is \(\in \mathcal {E}_{t+1}\)), and the \(b\text {-value}\) (\({\hat{f}}(x_{h,i})\)) is computed as shown in Sect. 4.2.1.

  2. 2.

    Leaf Node(s) Expansion \({\mathcal {Q}}_{t+1}\) is simply one node among those \(\in \mathcal {E}_{t+1}\) whose b-value is the maximum, where ties are broken arbitrarily—that is, if there are more than one node whose b-value is the maximum, then any node of these is selected arbitrarily to be in \({\mathcal {Q}}_{t+1}\).

LO techniques are computationally complex with an exponential growth in the number of problem dimensions [25, 37]. Moreover, L is often not known and overestimated resulting in a slow convergence performance. Such drawbacks limit the applicability of LO.

Variants of LO: Techniques from combinatorial optimization [34] can be employed in LO constructing lower and upper bounds for \(f(x_{h,i})\). Such bounds are exploited to eliminate a portion of the search space \({\mathcal {X}}\). In other words, some leaf nodes are never considered for expansion and only a small part of the tree on \({\mathcal {X}}\) has to be generated and processed. This results in Branch-and-Bound (BB)-like search algorithms [12, 18, 25, 42]. LO techniques are only reliable if analytical knowledge about the function f is available, or an approximation of L is used (e.g., see [51]). For performance gains, LO can be parallelized as demonstrated in [14, 55, 62]. A general discussion on parallel algorithms for global optimization has been presented in [61].

3.2.2 Deterministic optimistic optimization (DOO)

Munos [38] proposed DOO by assuming f to be locally smooth (around one of its global optima) with respect to a semi-metric \(\ell \). This assumption in DOO offers a relaxation over the restrictive assumption of LO. DOO estimates, based on local smoothness, the maximum upper boud of f within a partition. Similar to LO, this upper bound is used to guide the hierarchical partitioning of the space.

DOO constructs a tree on \({\mathcal {X}}\) whose settings with respect to the generic procedure outlined in Sect. 3.1, are the following: \(J=1\); K and P are equal (as with LO) and treated as a parameter by DOO with a default value of 3. The two steps of DOO at iteration \(t+1\) are summarized as the following:

  1. 1.

    Leaf Node(s) Evaluation \(\mathcal {P}_{t+1}\) is \({\mathcal {L}}_{{\mathcal {T}}_t}\setminus \mathcal {E}_{t}\), i.e., the set of leaf nodes that are not evaluated. For a node \((h,i)\in \mathcal {P}_t\), one function evaluations at (hi)’s centre is performed (hence (hi) becomes an evaluated node), and the \(b\text {-value}\) is computed as shown in Sect. 4.2.2.

  2. 2.

    Leaf Node(s) Expansion \({\mathcal {Q}}_{t+1}\) is simply one node among those \(\in \mathcal {E}_{{t+1}}\) whose b-value is the maximum, where ties are broken arbitrarily.

DOO, similarly to LO, is not applicable in practice as it requires the knowledge about f’s smoothness.

3.2.3 Dividing rectangles (DIRECT)

LO limitations in computational complexity and the knowledge of the Lipschitz constant L motivated Jones et al. to propose a search technique, DIRECT (stands for DIviding RECTangles) [29]. First of all, DIRECT does not need the knowledge of the Lipschitz constant L. Instead, it carries out the search by using all possible values of L from zero to infinity in a simultaneous framework, thereby balancing global and local search and improving the convergence speed significantly. This is captured by the heuristic rule (17). In addition to that, it introduces an additional constraint—on \({\mathcal {Q}}\)—whose tightness depends on a parameter \(\epsilon \ge 0\) of a nontrivial amount. If \(f_{max}\) is the best current function value, then \(\epsilon |f_{max}|\) is the minimum amount by which the b-values of \({\mathcal {Q}}\) must exceed \(f_{max}\) as will be illustrated later. This \(\epsilon \)-constraint (16) helps in protecting the algorithm from excessive local search. Furthermore, DIRECT cuts down the computational complexity from \(\mathcal {O}(2^N)\) to \(\mathcal {O}(1)\) by evaluating the cells’ centre points (which are their base points as well) instead of their vertices.

With respect to the generic procedure outlined in Sect. 3.1, \(J\le 1\),Footnote 6 \(K\le 3N, P\le 3N\), and \(Q=\mathcal {O}(K^{\texttt {depth}({\mathcal {T}})})\). Furthermore, let \(\sigma \) be a measure of a node’s size such that a node (hi) has a size of \(\sigma _{h,i}=\sigma ({\mathcal {X}}_{h,i})=||d_{h,i}||_2\) where \(d_{h,i}\) is \({\mathcal {X}}_{h,i}\)’s diameter. Denote the set of evaluated-node sizes by \(S_t\mathop {=}\limits ^{\mathrm{def}}\{\sigma _{h,i}:(h,i)\in \mathcal {E}_{t} \}\); and denote the set of iteration indices \(\{t,\ldots ,t+|S_t|-1\}\) by \(I_t\), where \(I_1\) is the first iteration batch, \(I_{|S_t|+1}\) is the second iteration batch, and so on. With \(\acute{t}\in I_t, \sigma ^{\acute{t}}_{S_t}\) is the \((\acute{t}-t+1)^{th}\) element of \(S_t\) in a descending order. Moreover, let \(f^t_{max}\) be the best function value achieved before t. The two steps of DIRECT at iteration \(\acute{t}\in I_t\) are then summarized as the following:

  1. 1.

    Leaf Node(s) Evaluation \(\mathcal {P}_{\acute{t}}\) is \({\mathcal {L}}_{{\mathcal {T}}_{\acute{t}-1}}\setminus \mathcal {E}_{{\acute{t}-1}}\), i.e., the set of leaf nodes that are not evaluated. For a node \((h,i)\in \mathcal {P}_{\acute{t}}\), one function evaluation at (hi)’s centre is performed (hence (hi) is \(\in \mathcal {E}_{\acute{t}}\)), and the \(b\text {-value}\) is computed as shown in Sect. 4.2.3.

  2. 2.

    Leaf Node(s) Expansion Let \(\mathbb {Q}^{I_t}_{\acute{t}}\) be the set of evaluated nodes whose size is \(\sigma ^{\acute{t}}_{S_t}\); mathematically: \(\mathbb {Q}^{I_t}_{\acute{t}}\mathop {=}\limits ^{\mathrm{def}}\{(h,i): (h,i)\in \mathcal {E}_{t}, \sigma _{h,i}=\sigma ^{\acute{t}}_{S_t}\}\), and \(b_{\acute{t}}\) is \(\max _{(h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}} b_{(h,i)}\), then \({\mathcal {Q}}_{\acute{t}}\) are set of nodes where each node \((h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}\) such that:

    $$\begin{aligned} b_{(h,i)} = b_{\acute{t}} \end{aligned}$$
    (15)

    and there exists \({\hat{L}}\ge 0\) such that:

    $$\begin{aligned}&b_{(h,i)}+\hat{L} \sigma ^{\acute{t}}_{S_t} \ge (1-\epsilon )f^t_{max} \end{aligned}$$
    (16)
    $$\begin{aligned}&b_{(h,i)}+\hat{L} \sigma ^{\acute{t}}_{S_t} \ge b_{\hat{t}} + \hat{L}\sigma ^{\hat{t}}_{S_t}\;\;, \forall \hat{t}\in I_t\setminus \acute{t} \end{aligned}$$
    (17)

    If such node does not exist, the algorithm proceeds to the next iteration without expanding any node at \(\acute{t}\) (see [20] for more details on how (16) and (17) are tested).

One can notice that within a batch of iterations, nodes are first contested among others of the same size, then among others of different size.

Variants of DIRECT Several interesting approaches have been proposed and inspired by DIRECT (see, e.g., [15, 19, 21, 31, 32, 35, 39, 40, 53, 54]). While DIRECT used \(L^2\)-norm to compute the global score g, Gablonsky et al. [21] used \(L^\infty \)-norm, thereby reducing the variations in the global score; and making the algorithm locally biased. Numerical experiments showed that it is good for problems with few global minima as compared to the original DIRECT algorithm. It has been shown in [15] that the \(\epsilon \)-constraint (16) is sensitive to additive scaling and leads to a slow asymptotic convergence. Hence, in [15], a modification has been proposed by using the threshold \(\epsilon |f_{max}-f_{median}|\) instead of \(\epsilon |f_{max}|\) where \(f_{median}\) is the median of the observed function values. Sergeyev et al. [53] coupled DIRECT’s idea of using several values of the Lipschitz constant L instead of a unique value, with estimating a bound on f; and proposed a two-phased algorithm that is suitable for multi-modal functions. The first phase focuses on moving closer towards discovered local optima by expanding nodes whose cells are close to the current best solution. On the other hand, the second phase is oriented towards discovering new local optima by expanding nodes with high global scores and far from the current best solution.

3.2.4 Multilevel coordinate search (MCS)

Huyer and Neumaier [27] addressed the slow-convergence shortcoming of DIRECT in optimizing functions whose optimizers lie at the boundary of \({\mathcal {X}}\); and devised a global optimization algorithm (called Multilevel Coordinate Search (MCS)). MCS partitions \({\mathcal {X}}\) into hyperrectangles of uneven sizes whose base points are not necessarily the centre points.

For a node \((h,i)\in \) its tree \({\mathcal {T}}\), MCS assigns a rank measure \(s_{h,i}\ge h\), which is used in selecting the expandable set \({\mathcal {Q}}\). The measure \(s_{h,i}\) captures how many times a node (hi) has been part of/candidate for an expansion process. We refer to this measure as pseudo-depth because it does not reflect the actual depth of the node. The children of node (hi) with pseudo-depth \(s_{h,i}\), can have upon creation a pseudo-depth of \(s_{h,i}+1\) or \(\min (s_{h,i}+2,s_{max})\) based on its size with respect to its siblings. The expandable set \({\mathcal {Q}}\) is selected based on pseudo-depth.

A node (hi) has a set of numbers \(\{n^{(h,i)}_j\}_{1\le j\le N}\) where \(n_j\) denotes the number of times \({\mathcal {X}}_{h,i}\) has been part of an expansion along coordinate j. \({\mathcal {T}}\)’s depth is controlled through a single parameter \(s_{max}\) which forces the tree to a maximal depth of \(s_{max}\). Given a fixed budget of node expansions, greater \(s_{max}\) reduces the probability of expanding an optimal node in \({\mathcal {T}}\), and hence a greater regret bound.

There are two heuristic rules employed to post-process \({\mathcal {Q}}\). The first rule (19) is based on \(\{n^{(h,i)}_j\}_{1\le j\le N}\) to expand nodes which have high pseudo-depths, yet there is at least one coordinate along which their corresponding hyperrectangles have not been part of an expansion very often. The second rule (20) is to expand a node along a coordinate where the maximal expected gain in function value is large enough; the gain \(\hat{e}^{(h,i)}_j\) for a node (hi) along coordinate j is computed using a local quadratic model [27]. Accordingly, if \(\max _{1\le j \le N} \hat{e}^{(h,i)}_j\) is large enough, (hi) is then eligible for expansion along the coordinate \(\arg \max _{1\le j \le N} \hat{e}^{(h,i)}_j\). If any of these rules does not hold for a node \(\in {\mathcal {Q}}\), it is removed from \({\mathcal {Q}}\) and its pseudo-depth is increased by one. Base points at depth \(s_{max}\) are put into a shopping basket, assuming them to be useful points. One can accelerate convergence by starting local searches from these points before putting them into the shopping basket.

With respect to the generic procedure outlined in Sect.  3.1, the settings of MCS are the following: \(J=1\); Based on the partitioning coordinate, P is \(L_i\ge 3\) where \(L_i\) is the number of sampled points along coordinate i. Consequently, K could be \(2L_i, 2L_i-1\), or \(2L_i-2\); and \(Q=1\). Furthermore, let \(I_t\) be the set of iteration indices \(\{t,\ldots ,t+s_{max}-1\}\); \(I_1\) is the first iteration batch, \(I_{s_{max}}\) is the second iteration batch, and so on. The two steps of MCS at iteration \(\acute{t}\in I_t\) are summarized as the following:

  1. 1.

    Leaf Node(s) Evaluation \(\mathcal {P}_{\acute{t}}\) is \({\mathcal {L}}_{{\mathcal {T}}_{\acute{t}-1}}\setminus \mathcal {E}_{{\acute{t}-1}}\), i.e., the set of leaf nodes that are not evaluated. For a node \((h,i)\in \mathcal {P}_{\acute{t}}\), one function evaluation is performed (hence \((h,i)\in \mathcal {E}_{\acute{t}}\)), and the \(b\text {-value}\) is computed as shown in Sect. 4.2.4.

  2. 2.

    Leaf Node(s) Expansion Let \(\mathbb {Q}^{I_t}_{\acute{t}}\) be \(\{(h,i): (h,i)\in \mathcal {E}_{{\acute{t}}}, s_{h,i}=\acute{t}-t\}\) (if \(\mathbb {Q}^{I_t}_{\acute{t}}=\emptyset \), the current iteration is simply skipped to \(\acute{t}+1\)), and \(b_{\acute{t}}\) is \(max_{(h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}} b_{(h,i)}\) then, \({\mathcal {Q}}_{\acute{t}}\) is simply the node \((h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}\) such that:

    $$\begin{aligned} b_{(h,i)}= & {} b_{\acute{t}} \end{aligned}$$
    (18)

    and fulfills the one of the two heuristics:

    $$\begin{aligned} \acute{t}-t> & {} 2N\left( \min _{1\le j \le N} n^{(h,i)}_j+1\right) \end{aligned}$$
    (19)

    or:

    $$\begin{aligned} b_{(h,i)}\ge & {} f_{max} - \max _{1\le j \le N} \hat{e}^{(h,i)}_{j} \end{aligned}$$
    (20)

    where ties are broken arbitrarily. If none of the heuristic rules holds, the node’s pseudo-depth is set to \(s_{h,i}+1\) and proceeds to the next iteration where it may be considered again.

3.2.5 Simultaneous optimistic optimization (SOO)

SOO [38] tries to approximate DOO’s behavior when \(\ell \) is unknown. It expands simultaneously all the nodes (hi) of its tree \({\mathcal {T}}\) for which there exists a semi-metric \(\ell \) such that the corresponding upper bound would be the greatest. This is simulated by expanding at most a leaf node per depth if such node has the greatest \(f(x_{h,i})\) with respect to leaf nodes of the same or lower depths. In addition to that, the algorithm takes a function \(n\rightarrow h_{max}(n)\), as a parameter, which forces \({\mathcal {T}}\) to a maximal depth of \(h_{max}(n)+1\) after n node expansions (e.g., \(h_{max}(n)=n^\epsilon \) where \(\epsilon >0\)).

With respect to the generic procedure outlined in Sect.  3.1, the settings of SOO are the following: \(J=1\); K and P are equal and treated as a parameter by SOO with a default value of 3; and \(Q=1\). Furthermore, let \(I_t\) be the set of iteration indices \(\{t,\ldots ,t+h_{max}(n)\}\); \(I_1\) is the first iteration batch, \(I_{h_{max}(n)}\) is the second iteration batch, and so on. The two steps, according to [46], of SOO at iteration \(\acute{t}\in I_t\) are summarized as the following:

  1. 1.

    Leaf Node(s) Evaluation \(\mathcal {P}_{\acute{t}}\) is \({\mathcal {L}}_{{\mathcal {T}}_{\acute{t}-1}}\setminus \mathcal {E}_{{\acute{t}-1}}\), i.e., the set of leaf nodes that are not evaluated. For a node \((h,i)\in \mathcal {P}_{\acute{t}}\), one function evaluation at (hi)’s centre is performed (hence (hi) is \(\in \mathcal {E}_{\acute{t}}\)), and the \(b\text {-value}\) is computed as shown in Sect. 4.2.5.

  2. 2.

    Leaf Node(s) Expansion Let \(\mathbb {Q}^{I_t}_{\acute{t}}\) be \(\{(h,i): (h,i)\in \mathcal {E}_{{t-1}}, h=\acute{t}-t\}\), and \(b_{\acute{t}}\) is \(\max _{(h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}} b_{(h,i)}\) then \({\mathcal {Q}}_{\acute{t}}\) is simply the node \((h,i)\in \mathbb {Q}^{I_t}_{\acute{t}}\) such that:

    $$\begin{aligned} b_{(h,i)}= & {} b_{\acute{t}}\end{aligned}$$
    (21)
    $$\begin{aligned} b_{(h,i)}\ge & {} b_{\hat{t}} \ \ \forall \hat{t}\ t\le \hat{t} <\acute{t} \end{aligned}$$
    (22)

    where ties are broken arbitrarily. If no such node exists, the current iteration is simply skipped to \(\acute{t}+1\).

Variants of SOO: Two main variants of SOO have been proposed in the literature, namely Stochastic Simultaneous Optimistic Optimization (StoSOO) [66], and Bayesian Multi-Scale Optimistic Optimization (BaMSOO) [67]. StoSOO addresses the situation where function evaluations are perturbed independently by noise. The b-values of StoSOO’s nodes are upper confidence bounds of f at their representative/base points. As a stochastic extension of SOO, it expands at most one node per depth after having it evaluated at its base point several times (i.e. \(J\ge 1\) function evaluations and all of them are at \(x_{h,i}\)). Multiple evaluations per node are to ensure that the expanded nodes are with high probability close to optimal. On the other hand, BaMSOO was designed with the goal of cutting down the number of function evaluations incurred by SOO. BaMSOO eliminates the need for evaluating representative states \(x_{h,i}\) deemed unfit by Gaussian process (GP) posterior bounds. Prior to evaluating a node (hi), upper and lower confidence bounds (UCB, LCB) on \(f(x_{h,i})\) are computed using a GP posterior fitted on the previous evaluations. These bounds are utilized to guide the search efficiently and possibly serve as the b-value of the corresponding node instead of \(f(x_{h,i})\).

3.3 Discussion on MSO algorithms

A brief discussion on the expandable sets \({\mathcal {Q}}\) and the rules of partitioning for different MSO algorithms are presented below.

The Expandable Set \({\mathcal {Q}}\) While the selected-to-be-evaluated set \(\mathcal {P}\) is almost the same for all the algorithms (a node is evaluated upon its creation), the process of selecting \({\mathcal {Q}}\) differs among them. Figure 2 shows a typical scenario of what \({\mathcal {Q}}\) could be in one (a batch of) iteration(s). Similar to Fig. 1b, it shows the evaluated leaves projected into the exploration-exploitation space. In LO, and DOO; each iteration is independent of each other; and a node is selected if it is the first node to lie on the curve from above. In DIRECT, MCS, and SOO; the case is different, iterations within a batch of iterations are co-dependent; a node is selected if it is among the first nodes to lie on the corresponding curve from above. However, from their visualizations, it can be argued that SOO and DIRECT have a greedier behaviour than MCS as they only expand a set of Pareto-optimal nodes in the exploration-exploitation plane.

Fig. 2
figure 2

Selecting expandable leaf node(s) \({\mathcal {Q}}\) (represented byblack dots) for an iteration in LO (a), for a batch of iterations in DIRECT (b) , a batch of iterations in MCS (c), an iteration in DOO (d), and a batch of iterations in SOO (e). The set Y, whose elements are represented by black and gray dots, is the set of projected evaluated leaves into the exploration-exploitation space

Rules of Partitioning (Expanding) a Node MSO algorithms use different procedures for expanding (partitioning) a node into its children. On one end, DOO, and SOO, with a partitioning factor K, partition the cell/hyperrectangle of a node by a single coordinate creating K cells of equal length along that coordinate; on the other end, LO partitions with respect to all the N coordinates, which results in \(2^N\) (not necessarily equal in size) cells. For MCS, knowledge built from sampled points is used to determine the partitioning coordinate as well as the position of the partitioning through the rules of golden section ratio and expected gain [27]. DIRECT stands out in two aspects: its partitioning may go by a single coordinate up to N coordinates in a way that creates nodes of good local and global scores; and the fact that these partitions do not apply to the node of interest but rather to the node itself and some of its descendants. In other words, if DIRECT is expanding a node (hi) by N coordinates, the first split takes place at that node, and the \(m^{th}\) split \((m\le N)\) is applied on a node of depth \(h+m-1\) which is part of the subtree rooted at (hi) [29]. Consequently, DIRECT’s tree’s depth may increase by N for expanding a node in \({\mathcal {Q}}\).

A number of rules for partitioning a node has been investigated and numerically validated [10, 30, 47]. For instance, numerical experiments in [30] showed that bisecting a node’s cell into two subcells works better than partitioning it into \(2^N\) subcells using N intersecting hyperplanes.

4 Convergence analysis of MSO algorithms

In this section, we propose a theoretical methodology for finite-time analysis of MSO algorithms. It is then applied to analyze different MSO algorithms in the literature.

4.1 Theoretical framework for convergence

We derive a measure of convergence rate by analysing the complexity of the regret r(n) (2). We upper-bound r(n) by quantifying the amount of exploration required in each of the multi-scale partitions to achieve a near-optimal solution. In line with [38], three basic assumptions are made; the first two assumptions assist in establishing a bounded bound on r(n), whereas the third assumption helps in computing it in finite time.

4.1.1 Bounding the regret r(n)

To bound r(n) (2), one needs to assume the characteristics of f. In LO, f is assumed to be Lipschitz-continuous [43], whereas DOO and SOO assume local smoothness on f [38]. Here, we impose the local smoothness assumption on f, defined formally as local Hölder continuity. Let \({\mathcal {T}}\) be a tree on \({\mathcal {X}}\) created by an MSO algorithm and \(\ell :{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}^+\) be a semi-metric such that \(\ell (x,y)=L ||x-y||^{\alpha }\) with L and \(\alpha \) being positive real constants.

Assumption 1

(local Hölder continuity) f is Hölder-continuous with respect to (at least) one of its global optimizer \(x^*\), that is to say, \(\forall x \in {\mathcal {X}}\), it satisfies Hölder condition [58] around \(x^*\):

$$\begin{aligned} |f^*-f(x)|\le L ||x^*-x||^{\alpha } = \ell (x^*,x) \end{aligned}$$
(23)

Remark 3

The class of Hölder-continuous functions is very broad. In fact, it has been shown in [20, 42] that among the Lipschitz-continuous functions (which are Hölder-continuous with \(\alpha =1\)) are convex/concave functions over a closed domain and continuously differentiable functions.

Before applying Assumption 1, we define the following.

Definition 3

A node \((h,i) \in {\mathcal {T}}\) is optimal if \(x^* \in {\mathcal {X}}_{h,i}\).

Definition 4

\(h^*_n \in {\mathbb {Z}}_0^+\) is the depth of the deepest optimal node that has been expanded up to n expansionsFootnote 7 and \((h^*_n,i^*_n) \in {\mathcal {T}}\) is the deepest expanded optimal node.

Given that \((h^*_n,i^*_n)\) is known, we can bound the regret as follows.

$$\begin{aligned} r(n)= & {} f^* - f(x(n)) \end{aligned}$$
(24)
$$\begin{aligned}\le & {} f^* -f \left( x_{h^*_n,i^*_n}\right) \end{aligned}$$
(25)
$$\begin{aligned}\le & {} \ell \left( x^*,x_{h^*_n,i^*_n}\right) \quad \quad \text {from (23)}\end{aligned}$$
(26)
$$\begin{aligned}\le & {} sup_{x\in {\mathcal {X}}_{h^*_n,i^*_n}} \ell (x,x_{h^*_n,i^*_n}) \end{aligned}$$
(27)

Presume that the evaluation of a node’s children is always coupled with its expansion.Footnote 8 This means that \(f({x}_{h^*_n+1,i^*_{nk}})\) for \(1\le k \le K\) are known. Consequently, there exists a tighter bound on r(n) than (27) as \(x^*\) is in one of \((h^*_n,i^*_n)\)’s children:

$$\begin{aligned} r(n) \le sup_{ 1\le k \le K } \ sup_{x\in {\mathcal {X}}_{h^*_n+1,i^*_{nk}}} \ell \left( x,x_{h^*_n+1,i^*_{nk}}\right) \le sup_{x\in {\mathcal {X}}_{h^*_n,i^*_n}} \ell \left( x,x_{h^*_n,i^*_n}\right) \end{aligned}$$
(28)

In order to have a bounded bound to on r(n), cells of \({\mathcal {T}}\)’s nodes need to be bounded. The next assumption implies that the bound on r(n) is bounded:

Assumption 2

(Bounded Cells) There exists a decreasing sequence \(\delta (h)=c\rho ^h\) in h such that

$$\begin{aligned} sup_{x,y\in {\mathcal {X}}_{h,i}}\ell (x,y)\le \delta (h) \ \ \forall (h,i) \in {\mathcal {T}} \end{aligned}$$
(29)

where c is a positive real number and \(\rho \in (0,1)\).

Consequently, from (28) the regret is bounded by:

$$\begin{aligned} r(n) \le \delta (h^*_n+1) \end{aligned}$$
(30)

Thus, finding \(h^*_n\) is the key to bound the regret. This is discussed in the next section.

4.1.2 Finding the depth of the deepest expanded optimal node, \(h^*_n\)

MSO algorithms may inevitably expand non-optimal nodes as part of their exploration -vs.-exploitation strategy. Hence, the relationship between the number of expansions n and \(h^*_n\) therefore may not be straight forward. Let us take an MSO algorithm \({\mathcal {A}}\) whose \({\mathcal {T}}\)’s deepest expanded optimal node after n expansions is at depth \(\acute{h}< \texttt {depth}({\mathcal {T}})\)- i.e., \(h^*_n=\acute{h}\)). For any node \((\acute{h}+1,i)\) to be expanded before \((\acute{h}+1,i^*)\), the optimal node at depth \(\acute{h}+1\); the following must hold for \(f(x_{\acute{h}+1,i})\):

$$\begin{aligned} b_{(\acute{h}+1,i)}\ge & {} b_{(\acute{h}+1,i^*)}\end{aligned}$$
(31)
$$\begin{aligned} l_{\acute{h}+1,i} + \lambda \cdot g_{\acute{h}+1,i}\ge & {} l_{\acute{h}+1,i^*} + \lambda \cdot g_{\acute{h}+1,i^*} \end{aligned}$$
(32)

from (11) and (12):

$$\begin{aligned} f(x_{\acute{h}+1,i}) + 2 \cdot \eta + \lambda \cdot \zeta\ge & {} f \left( x_{\acute{h}+1,i^*}\right) \end{aligned}$$
(33)

from (27) and (29):

$$\begin{aligned} f(x_{\acute{h}+1,i}) + 2 \cdot \eta + \lambda \cdot \zeta + \delta (\acute{h}+1)\ge & {} f^* \end{aligned}$$
(34)

Let us state three definitions to quantify and analyze nodes that satisfy (34).

Definition 5

The set \({\mathcal {X}}_{\epsilon } \mathop {=}\limits ^{\mathrm{def}}\{x \in {\mathcal {X}}, f(x) + \epsilon \ge f^* \}\) is the set of \(\epsilon \) -optimal states in \({\mathcal {X}}\).

Definition 6

The set \({\mathcal {I}}^\epsilon _h\mathop {=}\limits ^{\mathrm{def}}\{(h,i) \in {\mathcal {T}}: f(x_{h,i}) + \epsilon \ge f^*\}\) is the set of \(\epsilon \) -optimal nodes at depth h in \({\mathcal {T}}\). For instance, \({\mathcal {I}}^{2 \cdot \eta + \lambda \cdot \zeta + \delta (\acute{h}+1)}_{\acute{h}+1}\) is the set of nodes that satisfy (34).

Definition 7

\(h^\epsilon _n \in {\mathbb {Z}}^+_0\) is the depth of the deepest \(\epsilon \)-optimal node that has been expanded up to n expansions.

Let \(\varepsilon (h)=2 \cdot \eta + \lambda \cdot \zeta + \delta (h)\). If \({\mathcal {A}}\) allows expanding more than one node per depth, then we are certain that the optimal node at depth h gets expanded if all the nodes in \({\mathcal {I}}^{\varepsilon (h)}_h\) are expanded. Hence, \(h^*_n\) is guaranteed to be greater than or equal to h. Mathematically,

$$\begin{aligned} h^*_n \ge argmax \ \left\{ h : h \in \{0,\ldots , \texttt {depth}({\mathcal {T}})-1\},\ x\text { is expanded } \forall x \in {\mathcal {I}}^{\varepsilon (h)}_h\right\} \end{aligned}$$
(35)

Therefore, the relationship between n and \(h^*_n\) can be established by finding out how many expansions n are required to expand, at a given depth h, all the nodes in \({\mathcal {I}}^{\varepsilon (h)}_h\). It depends on two factors:

  1. 1.

    The number of the \(\varepsilon (h)\)-optimal nodes at depth \(h, |{\mathcal {I}}^{\varepsilon (h)}_h|\).

  2. 2.

    \({\mathcal {A}}\)’s strategy in expanding \({\mathcal {I}}^{\varepsilon (h)}_h\). If \({\mathcal {A}}\) has nodes at depth h, where \(\acute{h}+1< h <\texttt {depth}({\mathcal {T}})\), then such nodes are not necessarily \(\varepsilon (h)\)-optimal. In other words, only a portion of the n expansions is dedicated \(\cup _{h<\texttt {depth}({\mathcal {T}})}{\mathcal {I}}^{\varepsilon (h)}_h\). This depends on \({\mathcal {A}}\)’s expansion strategy.

While the second factor is \({\mathcal {A}}\)-dependent, the first depends on \(f, \ell \), and how the nodes are shaped. The first factor is discussed in the next section. In Theorem 1 of Sect. 4.1.4, we shall demonstrate how these two factors play a role in identifying the regret bound for a class of MSO algorithm.

Remark 4

(Another bound for r(n)) The condition (34) gives us another bound on the regret:

$$\begin{aligned} r(n)\le & {} 2\cdot \eta + \lambda \cdot \zeta + \delta (h) \ \forall h \in {0,\ldots ,h^{\varepsilon (h) }_n} \end{aligned}$$
(36)
$$\begin{aligned}\le & {} 2\cdot \eta + \lambda \cdot \zeta + \delta \left( h^{\varepsilon (h) }_n\right) \end{aligned}$$
(37)

Note that \(h^*_n \ge \hat{h}\) implies \(h^{\varepsilon (h)}_n\ge \hat{h}\).

4.1.3 Bounding \(|{\mathcal {I}}^{\varepsilon (h)}_h|\)

Consider the \(\epsilon \)-optimal states \({\mathcal {X}}_\epsilon \) in an N-dimensional space. Assume that \(\acute{\ell }(x^*,x)\le f(x^*)-f(x)\) where \(\acute{\ell }(x^*,x)=\acute{L}||x^*-x||^\beta , \acute{L}\) and \(\beta \) are positive real constants.Footnote 9 From Definition 5, \(\acute{\ell }(x^*,x)\le \epsilon , {\mathcal {X}}_\epsilon \) is then an \(\acute{\ell }\)-hypersphere of radius \(\epsilon \) centered at \(x^*\). From Definition 6, \({\mathcal {I}}^\epsilon _h\) have their base points \(x_{h,i}\) in \({\mathcal {X}}_{\epsilon }\). Since \(x_{h,i}\) can be anywhere in \({\mathcal {X}}_{h,i}\), covering \({\mathcal {X}}_{\epsilon }\) by \({\mathcal {I}}^\epsilon _h\) is regarded as a packing problem of \({\mathcal {I}}^\epsilon _h\) into \({\mathcal {X}}_\epsilon \). Note that, in general, cells of \({\mathcal {I}}^\epsilon _h\) will not be spheres. A bound on \(|{\mathcal {I}}^{\epsilon }_h|\) is formulated as the ratio of \({\mathcal {X}}_\epsilon \)’s volume to the smallest volume among the cells of \({\mathcal {I}}^\epsilon _h\):

$$\begin{aligned} |{\mathcal {I}}^{\epsilon }_h|\le & {} \frac{\sigma ({\mathcal {X}}_{\epsilon })}{\min _{(h,i)\in {\mathcal {I}}^\epsilon _h}\sigma ({\mathcal {X}}_{h,i})}\end{aligned}$$
(38)
$$\begin{aligned}= & {} \mathcal {O}\bigg (\frac{\epsilon ^{\frac{N}{\beta }}}{\min _{(h,i)\in {\mathcal {I}}^\epsilon _h}\sigma ({\mathcal {X}}_{h,i})}\bigg ) \end{aligned}$$
(39)

To simplify the bound further, we make an assumption about the cells shape in line with Assumption 2.

Assumption 3

[Cells Sphericity] There exists \(1\ge v > 0\) such that for any \(h\in {\mathbb {Z}}^+_0\), any cell \({\mathcal {X}}_{h,i}\) has an \(\ell \)-hypersphere of radius \(v\delta (h)\).

Remark 5

The purpose of Assumption 3 is to fit a \(v\delta (h)\)-radius \(\ell \)-hypersphere within cells of depth h, and hence the size of \({\mathcal {I}}_h^{\delta (h)}\) can be bounded more accurately. This depends on the hierarchical partitioning of the algorithm rather than the problem and holds seamlessly if the algorithm does not skew the shape of the cells. Almost all MSO algorithms maintain a coordinate-wise partition that does not skew the shape of their cells, and therefore, it is a reasonable assumption.

Cells Sphericity lets us have an explicit bound on the number of \(\delta (h)\)-optimal nodes at depth \(h, |{\mathcal {I}}^{\delta (h)}_h|\)- i.e., the number of nodes that cover the \(\acute{\ell }\)-hypersphere of radius \(\delta (h)\) at depth h. Subsequently, \(|{\mathcal {I}}^{\delta (h)}_h|\) has a bound of:

$$\begin{aligned} |{\mathcal {I}}^{\delta (h)}_h|= & {} \mathcal {O}\bigg (\frac{{\delta (h)}^{\frac{N}{\beta }}}{{\delta (h)}^{\frac{N}{\alpha }}}\bigg )\end{aligned}$$
(40)
$$\begin{aligned}= & {} \mathcal {O}\bigg ({{\rho }^{-h\big (N(\frac{1}{\alpha } -\frac{1}{\beta })\big )}}\bigg ) \end{aligned}$$
(41)

which comes in one of three cases:

  1. 1.

    \(\alpha < \beta \): At a given depth \(h, |{\mathcal {I}}^{\delta (h)}_h|\) scales exponentially with N.

  2. 2.

    \(\alpha =\beta \): At a given depth \(h, |{\mathcal {I}}^{\delta (h)}_h|\) is constant and independent of N.

  3. 3.

    \(\alpha > \beta \): This is not possible as it violates Assumption 1. Nevertheless, it tells us that if \(\ell \) was chosen such that there is no guarantee that \(f(x^*)-f(x)\le \ell (x^*,x)\), then it is possible that \({\mathcal {A}}\) may expand nodes other \(\delta (h)\)-optimal nodes at depth h and possibly converges to a local optimum. In such case, r(n) is of \(\mathcal {O}(1)\).

Let \(d_v\) be \(N(\frac{1}{\alpha }-\frac{1}{\beta })\). From (41), \(d_v\) can be seen as a quantitative measure of how much exploration is needed to guarantee optimality. For better understanding, Fig. 3 shows \({\mathcal {X}}_\epsilon \) for \(N=2\); it can be packed with \(\bigg (\frac{\epsilon ^{\frac{1}{\beta }}}{{(v\epsilon )}^{\frac{1}{\alpha }}}\bigg )^2=v^{-\frac{2}{\alpha }}\) \(\ell \)-circles of radius \(v\epsilon \). In such case, \(d_v\) is 0.

Fig. 3
figure 3

\({\mathcal {X}}_{\epsilon }\) in a 2-dimensional space is an \(\acute{\ell }\)-circle (Here, \(\acute{\ell }\) and \( \ell \) are the \(l_2\) norms, with \(\alpha \) and \(\beta \) set to 1) centered at \(x^*\) with a radius of \(\epsilon \)

In accordance with the Assumptions 1, 2, and 3, the next definition generalizes \(d_v\) and refers to it as the near-optimality dimension:

Definition 8

The v -near-optimality dimension is the smallest \(d_v \ge 0\) such that there exists \(C > 0\) and \(c>0\) such that for any \(1\ge \epsilon > 0\) and \(1\ge v>0\), the maximal number of disjoint \(\ell \)-hyperspheres of radius \(vc\epsilon \) and contained in \({\mathcal {X}}_{c\epsilon }\) is less than \(C\epsilon ^{-d_v}\).

The next lemma reformulates and generalizes the bound of (41) in the light of the near-optimality definition.

Lemma 1

\(|{\mathcal {I}}^{m\delta (h)}_h| \le C\rho ^{-hd_{v/m}}\), where \(m>0\).

Proof

From Assumption 3, a cell \({\mathcal {X}}_{h,i}\) has an \(\ell \)-hypersphere of radius \(\frac{v}{m}\cdot m\delta (h)\). As a result, if \(|{\mathcal {I}}^{m\delta (h)}_h|\) exceeds \(C\rho ^{-hd_{v/m}}\), then there are more than \(C\rho ^{-hd_{v/m}}\) \(\ell \)-hypersphere of radius \(\frac{v}{m}\cdot m\delta (h)\) in \({\mathcal {X}}_{m\delta (h)}\) which contradicts the definition of \(d_{v/m}\). \(\square \)

Using the above lemma, the bound of (41) can be reformulated as \(|{\mathcal {I}}^{\delta (h)}_h| \le C\rho ^{-hd_v}\). Having constructed an explicit bound on \(|{\mathcal {I}}^{\delta (h)}_h|\), we, now, bound \(|{\mathcal {I}}^{\varepsilon (h)}_h|\), which is addressed in the next lemma.

Lemma 2

For \(h< h_\varepsilon \), we have \(|{\mathcal {I}}^{\varepsilon (h)}_h|\le C \rho ^{-hd_{v/2}}\) such that

$$\begin{aligned} h_{\varepsilon }\mathop {=}\limits ^{\mathrm{def}}\min \{h:\delta (h)< 2\cdot \eta +\lambda \cdot \zeta \} \end{aligned}$$
(42)

and \(d_{v/2}\) is the v / 2-near-optimality dimension.

Proof

Bounding \(|{\mathcal {I}}^{\varepsilon (h)}_h|\) in a similar way to \(|{\mathcal {I}}^{\delta (h)}_h|\) requires that a cell \({\mathcal {X}}_{h,i}\) has an \(\ell \)-hypersphere of radius \(\acute{v}(\delta (h)+2\cdot \eta +\lambda \cdot \zeta )\) where \(1\ge \acute{v}>0\). From Assumption 3, we know that a cell \({\mathcal {X}}_{h,i}\) has an \(\ell \)-hypersphere of radius \({v}\delta (h)\) where \(1\ge {v}>0\). Thus, for a cell \({\mathcal {X}}_{h,i}\) to have an \(\ell \)-hypersphere of radius \(\acute{v}(\delta (h)+2\cdot \eta +\lambda \cdot \zeta )\), we need to ensure that \(\acute{v}(\delta (h)+2\cdot \eta +\lambda \cdot \zeta )\le {v}\delta (h)\). With respect to a v / 2-near-optimality dimension, this holds for any depth h where \(\delta (h)\ge 2\cdot \eta +\lambda \cdot \zeta \) because \(\frac{v}{2}\cdot 2\cdot \delta (h)\ge \acute{v}(\delta (h)+2\cdot \eta +\lambda \cdot \zeta )\) where \(\acute{v}=\frac{v}{2}\). Now, since \(\delta (h)\) is a decreasing sequence in h, this is valid for depths less than \(\min \{h:\delta (h)< 2\cdot \eta +\lambda \cdot \zeta \}\) denoted by \(h_\varepsilon \).

Up to this point, we know that \(|{\mathcal {I}}^{\varepsilon (h)}_h|\) for \(h< h_\varepsilon \) is upper-bounded by the maximal number of disjoint \(\ell \)-hypersphere of \(\frac{{v}}{2}\cdot 2\cdot \delta (h)\) packed in \({\mathcal {X}}_{2\delta (h)}\). Consequently, from Definition 8 and similar to the proof of Lemma 1, we have:

$$\begin{aligned} \left| {\mathcal {I}}^{\varepsilon (h)}_h \right|&\le C\rho ^{-hd_{v/2}}, \ \ \forall h< h_\varepsilon \end{aligned}$$
(43)

where \(d_{v/2}\) is the v / 2-near-optimality dimension. \(\square \)

4.1.4 A convergence theorem

In this section, we present a theorem on the convergence of a class of MSO algorithms that adopt an expansion strategy of minimizing their trees’ depths by connecting \(r(n), h^*_n, h^{\varepsilon (h)}_n\), and \({\mathcal {I}}^{\varepsilon (h)}_h\). Afterwards, three examples are worked out to see the effect of some parameters on the complexity of r(n).

Theorem 1

Let the assumptions of local Hölder continuity (Assumption 1), bounded cells (Assumption 2), and cells sphericity (Assumption 3) hold and let \({\mathcal {A}}\) be an MSO algorithm with a partition factor of K whose local and global score functions satisfy (11) and (12), respectively. Furthermore, \({\mathcal {A}}\) expands for each \(\varepsilon (h)\)-optimal node at most \(M-1\) other nodes with an expansion strategy of minimizing its tree \({\mathcal {T}}\)’s depth \(\texttt {depth}({\mathcal {T}})\).Footnote 10 Then the regret of \({\mathcal {A}}\) after n expansions is bounded as:

$$\begin{aligned} r(n) \le \delta \left( \max (\acute{h}(n),\min (h(n),h_\varepsilon ))\right) + 2\cdot \eta \end{aligned}$$
(44)

where \(\acute{h}(n)\) is the smallest \(h\in {\mathbb {Z}}^+_0\), such that:

$$\begin{aligned} M\sum _{l=0}^{\acute{h}(n)}K^l\ge n \end{aligned}$$
(45)

and h(n) is the smallest \(h\in {\mathbb {Z}}^+_0\), such that:

$$\begin{aligned} CM\sum _{l=0}^{h(n)}\rho ^{-ld_{v/2}} \ge n \end{aligned}$$
(46)

and \(h_\varepsilon \) is the smallest \(h\in {\mathbb {Z}}^+_0\) that satisfies (42).

Proof

Consider any \(h(n)\le h_\varepsilon \). From the definition of h(n) (46) and Lemma 2, we have:

$$\begin{aligned} M\sum _{l=0}^{h(n)-1}\left| {\mathcal {I}}^{\varepsilon (l)}_l\right| \le CM\sum _{l=0}^{h(n)-1}\rho ^{-ld_{v/2}}< n \end{aligned}$$
(47)

Since for each \(\varepsilon (h)\)-optimal node at most M other nodes are expanded in a way that minimizes the tree’s depth, we deduce from (47) that the depth of deepest expanded optimal node \(h^*_n\ge h(n)-1\) and the depth of the deepest expanded \(\varepsilon (h)\)-optimal node \(h^{\varepsilon (h)}_n\ge h(n)\). On the other hand, for \(h(n) > h_\varepsilon \), there is no valid bound on \(|{\mathcal {I}}^{\varepsilon (l)}_l|\). Thus, \(h^*_n \ge \min (h_\varepsilon ,h(n))-1\) and \(h^{\varepsilon (h)}_n\ge \min (h_\varepsilon ,h(n))\). With \(h^*_n\) and \(h^{\varepsilon (h)}_n\) in hand, we have two bounds on r(n). Consider first the bound based on \(h^*_n\). Let \((h^*,i^*)\) be the child node of the deepest expanded optimal node that contains \(x^*\), we have:

$$\begin{aligned} r(n) \le f^*-f(h^*,i^*) \end{aligned}$$
(48)

Now, since \({\mathcal {A}}\) deals with the approximate l (11) of f values at the nodes’ base points, the regret bound, with respect to \({\mathcal {A}}\)’s solution l(x(n)), is expressed as:

$$\begin{aligned} r(n)\le & {} f^* - f(x(n)) + \eta \quad \text {since } |l(x(n))-f(x(n))|\le \eta \;,\end{aligned}$$
(49)
$$\begin{aligned}\le & {} f^* -f(x_{h^*,i^*}) + 2\cdot \eta \quad \text {since } l(x(n))\ge l(x_{h^*,i^*})\;. \end{aligned}$$
(50)

From Assumptions 12 and since \(h^*=h^*_n+1\ge \min (h_\varepsilon ,h(n))\), we have:

$$\begin{aligned} r(n)&\le \delta (\min (h_\varepsilon ,h(n))) + 2\cdot \eta \end{aligned}$$
(51)

On the other hand, consider the bound based on \(h^{\varepsilon (h)}_n\ge \min (h_\varepsilon ,h(n))\). From Remark 4, the regret is bounded as \(r(n)\le 2\cdot \eta + \lambda \cdot \zeta + \delta (\min (h_\varepsilon ,h(n)))\). Clearly, the bound (51) is tighter. However, it relies on \(h_\varepsilon \) and hence can be really pessimistic (e.g. when \(h_{\varepsilon }=0\)). We may achieve a better bound by utilizing the fact that \({\mathcal {A}}\)’s strategy is to minimize its tree’s depth; and that for n expansions, there are at least \(\lfloor \frac{n}{M}\rfloor \) expanded \(\varepsilon (h)\)-optimal nodes. From the definition of \(\acute{h}(n)\) and \({\mathcal {A}}\)’s strategy, we deduce that \(h^*_n\ge \acute{h}(n)-1\) and \(h^{\varepsilon (h) }_n\ge \acute{h}(n)\). Thus:

$$\begin{aligned} r(n) \le \delta \left( \acute{h}(n)\right) + 2\cdot \eta \end{aligned}$$
(52)

Therefore, r(n) here has two bounds, viz. (51), (52) from which we choose the tightest as in (44). \(\square \)

It is important to note here that not all MSO algorithms aim to minimize their trees’ depths. Nevertheless, the aim of the theorem is to stress that there are usually two possible approaches to obtain a regret bound: the first involves identifying the link between n and \(h^*_n\) (Sect. 4.1.2); and the second is based on identifying the link between n and \(h^{\varepsilon (h)}_n\) (Remark 4). Furthermore, it showed that even when the two approaches are infeasible, an MSO algorithm’s expansion strategy may help in establishing a better bound. The following examples evaluate the regret bound for different settings of Theorem 1’s parameters.

Example 1

\(\big (r(n) \,{ \mathrm for }\, d_{v/2}=0, \acute{h}(n)<\min (h(n),h_\varepsilon )\big )\) From (46), we have:

$$\begin{aligned} CM\sum _{l=0}^{h(n)}\rho ^{-ld_{v/2}}\ge & {} n \end{aligned}$$
(53)
$$\begin{aligned} CM(h(n)+1)\ge & {} n \quad \quad \text {since } d_{v/2}=0 \end{aligned}$$
(54)
$$\begin{aligned} h(n)\ge & {} \frac{n}{CM}-1\; \end{aligned}$$
(55)

We have two cases with respect to \(h_\varepsilon \):

  1. 1.

    \(\eta =0\) and \(\zeta =0\), for which \(h_\varepsilon =\infty \). Therefore, r(n) decays exponentially with n:

    $$\begin{aligned} r(n)\le c\rho ^{\frac{n}{CM}} \end{aligned}$$
    (56)
  2. 2.

    \(\eta >0\) or \(\zeta >0\), for which \(h_\varepsilon =\acute{h}<\infty \). Therefore:

    $$\begin{aligned} r(n)\le c\rho ^{\min \left( \frac{n}{CM},\acute{h}\right) } + 2\cdot \eta \end{aligned}$$
    (57)

    Clearly, for \(1\le n\le CM\acute{h}\), the regret decays exponentially with n towards \(2\cdot \eta \). For \(n> CM\acute{h}\), the best bound on r(n) equals \(c\rho ^{\acute{h}}+2\cdot \eta \).

Example 2

(r(n) for \(d_{v/2}>0, \acute{h}(n)<\min (h(n),h_\varepsilon )\)) From (46), we have:

$$\begin{aligned} CM\sum _{l=0}^{h(n)}\rho ^{-ld_{v/2}} \ge n \end{aligned}$$
(58)

From the sum of geometric series formula:

$$\begin{aligned} CM\left( \rho ^{-d_{v/2}(h(n)+1)}-1\right)\ge & {} n\left( \rho ^{-d_{v/2}}-1\right) \end{aligned}$$
(59)
$$\begin{aligned} \rho ^{-d_{v/2}h(n)}> & {} \frac{n}{CM}\left( 1-\rho ^{d_{v/2}}\right) \end{aligned}$$
(60)
$$\begin{aligned} \rho ^{h(n)}< & {} \left( \frac{CM}{1-\rho ^{d_{v/2}}}\right) ^{\frac{1}{d_{v/2}}} \cdot n^{-\frac{1}{d_{v/2}}} \end{aligned}$$
(61)

h(n) is logarithmic with n. We have two cases with respect to \(h_\varepsilon \):

  1. 1.

    \(\eta =0\) and \(\zeta =0\), for which \(h_\varepsilon =\infty \). Therefore, r(n) decays polynomially with n:

    $$\begin{aligned} r(n)< c \left( \frac{CM}{1-\rho ^{d_{v/2}}}\right) ^{\frac{1}{d_{v/2}}} \cdot n^{-\frac{1}{d_{v/2}}} \end{aligned}$$
    (62)
  2. 2.

    \(\eta >0\) or \(\zeta >0\), for which \(h_\varepsilon =\acute{h} <\infty \). Therefore:

    $$\begin{aligned} r(n)\le c\cdot \max \left( \rho ^{\acute{h}},\left( \frac{CM}{1-\rho ^{d_{v/2}}} \right) ^{\frac{1}{d_{v/2}}} \cdot n^{-\frac{1}{d_{v/2}}}\right) + 2\cdot \eta \end{aligned}$$
    (63)

Example 3

(r(n) for \(\acute{h}(n)>\min (h(n),h_\varepsilon )\)) From (45), we have:

$$\begin{aligned} M\sum _{l=0}^{\acute{h}(n)}K^l\ge & {} n \end{aligned}$$
(64)
$$\begin{aligned} K^{\acute{h}(n)+1}\ge & {} \frac{n}{M} (K-1) + 1\end{aligned}$$
(65)
$$\begin{aligned} \acute{h}(n)\ge & {} \lceil \log _{K}(\frac{n}{M})+ \log _K(K-1) -1 \rceil \end{aligned}$$
(66)

\(\acute{h}(n)\) is logarithmic with n and r(n) is bounded as:

$$\begin{aligned} r(n)\le c\rho ^{\lceil \log _{K}(\frac{n}{M})+ \log _K(K-1) -1 \rceil }+ 2*\eta \end{aligned}$$
(67)

Remark 6

(MSO vs. Uniform Sampling) It is interesting to note that for an N-dimensional function where \(|f^*-f(x)|=||x^*-x||^\beta \), a uniform grid of Kn samples exhibits a polynomially decaying regret of \(\mathcal {O}(n^{-\frac{\beta }{N}})\) whereas an MSO algorithm with a partition factor K and n expansions may have an exponentially decaying regret of \(\mathcal {O}(\rho ^{\frac{n}{CM}})\) (Example 1) irrespective of the problem dimensions N.

4.2 Analysis of MSO algorithms

Using the theoretical framework established in the previous section, we analyze the finite-time behaviour of different MSO algorithms in the literature.

4.2.1 Analysis of LO

Several researchers have addressed the convergence analysis of LO techniques [24, 41]. Here, we present a complimentary analysis under the framework of MSO. Let \(N=1\) which implies that \({\mathcal {X}}_{h,i}\) is an interval \([c_{h,i},d_{h,i}]\) where \(c_{h,i}\) and \(d_{h,i}\) are its endpoints. Furthermore, \(\ell (x,y)\) here is \(L|x-y|\). Using the Lipschitz condition (14), \({\hat{f}}(x_{h,i})\)- and eventually \(b_{(h,i)}\)- is computed as:

$$\begin{aligned} b_{(h,i)}={\hat{f}}(x_{h,i})=\frac{f(c_{h,i})+f(d_{h,i})}{2} + L \frac{d_{h,i}-c_{h,i}}{2} \end{aligned}$$
(68)

This can be made equivalent to the exploitation-exploration trade-off (13) where the local score \(l_{h,i}=\frac{f(c_{h,i})+f(d_{h,i})}{2}, \lambda =L\), and the global score \(g_{h,i}= \frac{d_{h,i}-c_{h,i}}{2}\). From the Lipschitz condition (14) and Assumption 2, we have:

$$\begin{aligned} 0\le {\hat{f}}(x_{h,i})-f(x_{h,i}) \le L(d_{h,i}-c_{h,i}) \le \delta (h)\;. \end{aligned}$$
(69)

The next lemma shows that LO expands \(2\delta (h)\)-optimal nodes in search for the optimal node at depth h.

Lemma 3

In LO and at depth h, a node (hi) is expanded before the optimal node \((h^*,i^*)\), if:

$$\begin{aligned} f(x_{h,i}) + 2 \delta (h) \ge f^* \;. \end{aligned}$$
(70)

Proof

In LO, expanding more than one node per depth is possible; a node (hi) is expanded before the optimal node \((h^*,i^*)\) when \(f(x_{h,i})\) satisfies the following:

$$\begin{aligned} b_{(h,i)}\ge & {} b_{(h^*,i^*)} \end{aligned}$$
(71)
$$\begin{aligned} {{\hat{f}}}(x_{h,i})\ge & {} {\hat{f}}(x_{h^*,i^*}) \end{aligned}$$
(72)

From (69):

$$\begin{aligned} f(x_{h,i}) + \delta (h)&\ge f(x_{h^*,i^*}) \end{aligned}$$
(73)

From the Lipschitz condition (14) which is in line with Assumption 1 and Assumption 2, we have \(f^* - f(x_{h^*,i^*})\le \sup _{x\in {\mathcal {X}}_{h,i}}\ell (x_{h,i},x)\le \delta (h)\); from which (70) follows. \(\square \)

Therefore, we are certain that \(h^*_n\ge h\) if all the \(2\delta (h)\)-optimal nodes at depth h are expanded. The following theorem establishes the regret bound for LO algorithms.

Theorem 2

(Convergence of LO) Let \({\mathcal {T}}\) be LO’s tree on \({\mathcal {X}}\) and h(n) be the smallest \(h\in {\mathbb {Z}}^+_0\), such that:

$$\begin{aligned} C\sum _{l=0}^{h(n)}\rho ^{-ld_{v/2}} \ge n \end{aligned}$$
(74)

where \(d_{v/2}\) is the v / 2-near-optimality dimension and n is the number of expansions. Then the regret of LO is bounded as \(r(n)\le 2\delta (h(n))\).

Proof

From Lemma 3, LO expands only nodes that are \(2\delta (h)\)-optimal for \(0\le h\le \texttt {depth}({\mathcal {T}})\). As a result, the depth of deepest expanded \(2\delta (h)\)-optimal node \(h^{2\delta (h)}_n\) is \(\texttt {depth}({\mathcal {T}})-1\), and hence bounding the regret in the light of Remark 4 as follows:

$$\begin{aligned} r(n) \le 2\delta (\texttt {depth}({\mathcal {T}})-1) \end{aligned}$$
(75)

Clearly, the bound depends on \(\texttt {depth}({\mathcal {T}})\). Let us try to make this bound more explicit by finding out the minimum \(\texttt {depth}({\mathcal {T}})\) with n expansions. In line with Lemma 1, we have \(|{\mathcal {I}}^{2\delta (h)}_h|\le C\rho ^{-hd_{v/2}}\). This implies along with the definition of h(n):

$$\begin{aligned} \sum _{l=0}^{h(n)-1} \left| {\mathcal {I}}^{2\delta (l)}_{l}\right| \le C\sum _{l=0}^{h(n)-1}\rho ^{-ld_{v/2}}< n \end{aligned}$$
(76)

that \(h^{2\delta (h)}_n\ge h(n)\). Therefore, \(r(n)\le 2\delta (h(n))\). \(\square \)

Similar analysis can be applied when \(N>1\). Here, \({\mathcal {X}}_{h,i}\) is a hyperrectangle and \(\delta (h)\) bounds a norm \(L||x-y||\) rather than an interval.

4.2.2 Analysis of DOO

With respect to the theoretical framework, [38] provides the analysis of DOO for \(\delta (h)< 1\). Here, we provide a generalized analysis (including \(\delta (h)\ge 1\)) by modifying [38, Theorem 1]. Let us start with the b-value of a node (hi):

$$\begin{aligned} b_{(h,i)}=f(x_{h,i}) + \delta (h) \end{aligned}$$
(77)

With reference to (13), \(l_{h,i}=f(x_{h,i}), \lambda =1\), and \(g_{h,i}= \delta (h)\).

The next lemma shows that DOO expands \(\delta (h)\)-optimal nodes in search for the optimal node at depth h.

Lemma 4

In DOO and at depth h, a node (hi) is expanded before the optimal node \((h^*,i^*)\), if:

$$\begin{aligned} f(x_{h,i}) + \delta (h) \ge f^* \;. \end{aligned}$$
(78)

Proof

In DOO, expanding more than one node per depth is possible; a node (hi) is expanded before the optimal node \((h^*,i^*)\) when \(f(x_{h,i})\) satisfies the following:

$$\begin{aligned} b_{(h,i)}\ge & {} b_{(h^*,i^*)}\end{aligned}$$
(79)
$$\begin{aligned} f(x_{h,i})\ge & {} f(x_{h^*,i^*}) \end{aligned}$$
(80)

From Assumptions 12, we have \(f^* - f(x_{h^*,i^*})\le \sup _{x\in {\mathcal {X}}_{h,i}}\ell (x_{h,i},x)\le \delta (h)\); from which (78) follows. \(\square \)

Therefore, we are certain that \(h^*_n\ge h\) if all the \(\delta (h)\)-optimal nodes at depth h are expanded. The following theorem establishes the regret bound for DOO.

Theorem 3

(Convergence for DOO) Let us write h(n) the smallest \(h\in {\mathbb {Z}}^+_0\) such that

$$\begin{aligned} C\sum _{l=0}^{h(n)} \rho ^{-ld_v}\ge n \end{aligned}$$
(81)

where \(d_v\) is the v-near-optimality dimension and n is the number of expansions. Then the regret of DOO is bounded as \(r(n)\le \delta (h(n))\).

Proof

From Lemma 4, DOO expands only nodes that are \(\delta (h)\)-optimal for \(0\le h\le \texttt {depth}({\mathcal {T}})\). As a result, the depth of deepest expanded \(\delta (h)\)-optimal node \(h^{\delta (h)}_n\) is \(\texttt {depth}({\mathcal {T}})-1\), and hence bounding the regret in the light of Remark 4 as follows:

$$\begin{aligned} r(n) \le \delta (\texttt {depth}({\mathcal {T}})-1) \end{aligned}$$
(82)

Clearly, the bound depends on \(\texttt {depth}({\mathcal {T}})\). Let us try to make this bound more explicit by finding out the minimum \(\texttt {depth}({\mathcal {T}})\) with n expansions. In line with Lemma 1, we have \(|{\mathcal {I}}^{\delta (h)}_h|\le C\rho ^{-hd_{v}}\). This implies along with the definition of h(n):

$$\begin{aligned} \sum _{l=0}^{h(n)-1} \left| {\mathcal {I}}^{\delta (l)}_l\right| \le C\sum _{l=0}^{h(n)-1}\rho ^{-ld_{v}}< n \end{aligned}$$
(83)

that \(h^{\delta (h)}_n\ge h(n)\). Therefore, \(r(n)\le \delta (h(n))\). \(\square \)

4.2.3 Analysis of DIRECT

The b-value of a node in DIRECT is its local score l, with \(l_{h,i}=f(x_{h,i}), \lambda =0\), and \(g_{h,i}=\sigma _{h,i}=||d_{h,i}||_2\) where \(d_{h,i}\) is \({\mathcal {X}}_{h,i}\)’s diameter. As \(\lambda =0\), DIRECT may seem to be an exploitative MSO algorithm, which is not the case. The global score is employed in a heuristic (represented by (16) and (17)) for selecting \({\mathcal {Q}}\) within a batch of iterations, balances the exploration-vs.-exploitation trade-off. Given that the optimal node at depth \(\acute{h}-1\) has been expanded, for any node \((\acute{h},i)\) to be expanded before the optimal node \((\acute{h},i^*)\), the following must hold (assuming (16) and (17) hold):

$$\begin{aligned} b_{(\acute{h},i)}\ge & {} b_{(\acute{h},i^*)} \end{aligned}$$
(84)
$$\begin{aligned} f(x_{\acute{h},i}) + \delta (\acute{h})\ge & {} f^* \end{aligned}$$
(85)

For such depths, DIRECT expands \(\delta (h)\)-optimal nodes at depth h if the heuristic rules hold. However, there is no guarantee that the heuristic rules hold [15]. In [15, Theorem2], it has been shown that DIRECT may behave as an exhaustive grid search expanding nodes based solely on their sizes. In Theorem 4, we provide a finite-timeFootnote 11 regret bound on DIRECT by exploiting [15]’s findings that within a batch of iteration \(I_t\), there exists at least one node \(\in \mathcal {E}_{t-1}\) that satisfies (16) and (17) and hence gets expanded within \(I_t\). Such node is simply the node \((h_*,i_*)\in \mathcal {E}_{t-1}\) such that \(b_{(h_*,i_*)}=b_t\).

Theorem 4

(Convergence of DIRECT) Let us define h(n) as is the smallest \(h \in {\mathbb {Z}}^+_0\) such that:

$$\begin{aligned} \sum \limits _{l=0}^{h(n)}3^{l} \ge n_I \end{aligned}$$
(86)

where \(n_I\) is the greatest positive integer number such that:

$$\begin{aligned} {n_I}(n_I-1)\le \frac{2}{N^2}(n-N) \end{aligned}$$
(87)

where n is the number of expansions. Then the regret of DIRECT with \(Q=1\) is bounded as:

$$\begin{aligned} r(n)\le \delta (h(n)) \end{aligned}$$
(88)

Proof

DIRECT expands at least one node per batch of iterations; and this node is one among those of the largest size among the leaf nodes. Thus, as DIRECT has a partitioning factor of 3; given a number of batches \(n_I\); and from the definition of h(n), at depth \(h(n)-1\), all the nodes (optimal and non-optimal) are expanded. Hence, \(r(n)\le \delta (h(n))\).

Since we are interested in bounding the regret as a function of the number of expansions n, we need to find what is the maximum number of expansions for \(n_I\) iteration batches. In a given batch \(I_t\), the maximum number of expansions with \(Q=1\) is \(N\cdot \texttt {depth}({\mathcal {T}}_{t-1})\), with \(\texttt {depth}({\mathcal {T}})\) growing at most by N per iteration batch. Thus, with 3 iteration batches, for instance, we can have at most \(N+1\cdot N\cdot N+2 \cdot N\cdot N\) expansions. For m batch of iterations, there are at most \(N+N^2\sum _{i=0}^{m-1}i = N + \frac{1}{2}\cdot N^2 \cdot m (m-1)\) expansions. Therefore, for n expansions, the minimum possible number of completed iteration batches is the greatest positive integer \(n_I\) iteration such that:

$$\begin{aligned} N+ \frac{1}{2}\cdot N^2\cdot n_I (n_I-1)\le n \end{aligned}$$

from which (87) follows. \(\square \)

4.2.4 Analysis of MCS

The following factors influence the analysis of MCS. First, the set of nodes to be considered for expansion are not of the same scale, and hence, no statement can be made about the optimality of the expanded nodes. Second, even if MCS is considering near-optimal nodes for expansion, the heuristics (19) and (20) may not hold which results in moving such nodes into groups of different pseudo-depth. Third, the fact that two nodes of consecutive depths h and \(h+1\) may have the same size makes Assumption 2 more associated with the node’s first pseudo-depth value rather than its depth h (i.e. for a node (hi) whose s upon creation is \(s_{h,i}\), then \(sup_{x,y \in {\mathcal {X}}_{h,i}}\ell (x,y)\le \delta (s_{h,i})\) rather than \(\delta (h)\)).

The b-value of a node in MCS is its local score, with \(l_{h,i}=f(x_{h,i}), \lambda =0\), and \(g_{h,i}=-s_{h,i}\). The global score is used to group the nodes considered for expansion at an iteration. Assume that all the nodes keep their initial pseudo-depth; given that the optimal node at a pseudo-depth \(s-1\) has been expanded, its optimal child node \((\acute{h},i^*)\) may have a pseudo-depth of \(s+1\) for which no statement can be made about the nodes with a pseudo-depth of s. Nonetheless, if \((\acute{h},i^*)\) is at pseudo-depth s, then for any node (hi) of the same pseudo-depth to be expanded before \((\acute{h},i^*)\), the following must hold (assuming either of the heuristics (19) or (20) holds as well):

$$\begin{aligned} b_{(h,i)}\ge & {} b_{(\acute{h},i^*)} \end{aligned}$$
(89)
$$\begin{aligned} f(x_{h,i}) + \delta (s)\ge & {} f^* \end{aligned}$$
(90)

For such case, MCS expands \(\delta (s)\)-optimal nodes (with regards to MCS, we refer to a node that satisfies (90), where s is the node’s initial pseudo-depth as a \(\delta (s)\)-optimal node, and denote the set of such nodes by \({\mathcal {I}}^{\delta (s)}\)) and may expand non-\(\delta (s)\)-optimal nodes, otherwise. Clearly, the analysis is complicated. However, we can simplify it with an assumption about the structure of the maximal expected gain \(\max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\) for \(\delta (s)\)-optimal nodes. With this assumption, the relationship between \(h^*_n\) and n can be established. This is demonstrated in the next lemma.

Lemma 5

In MCS, for any depth \(0\le h< s_{max}\), whenever \(n \ge \sum _{l=0}^{s}|{\mathcal {I}}^{\delta (l)}|\,(s_{max} -1)\) and \(\max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\ge \delta (s), \forall \delta (s)\text {-optimal node}\in {\mathcal {T}}\) where \(0\le s \le s_{max}-1\), we have \(h^*_n\ge s\), where n is the number of expansions.

Proof

We know that \(h^*_n\ge 0\) and hence the above statement holds for \(s=0\). For \(s\ge 1\), we are going to prove it by induction.

Assume that the statement holds for \(0\le s\le \hat{s}\). We prove it for \(s\ge \hat{s}+1\). Let \(n \ge \sum _{l=0}^{\hat{s}+1}|{\mathcal {I}}^{\delta (l)}|(s_{max}-1)\). Consequently, we know that \(n \ge \sum _{l=0}^{\hat{s}}|{\mathcal {I}}^{\delta (l)}|(s_{max}-1)\) for which \(h^*_n\ge \hat{s}\). Here, we have two cases: \(h^*_n\ge \hat{s}+1\), for which the proof is done; or \(h^*_n=\hat{s}\). In the second case, any node (hi) at pseudo-depth \(\hat{s}+1\) that is expanded before the optimal node of the same pseudo-depth has to be \(\delta (\hat{s}+1)\)-optimal. However, the heuristics of MCS may possibly cause the expansion of such nodes to be skipped and expanding at the same time non-\(\delta (s)\)-optimal nodes at higher pseudo-depths. Nonetheless, if the computed expected gain for a \(\delta (\hat{s}+1)\)-optimal node (hi) satisfies \(\max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\ge \delta (\hat{s}+1)\), then we are certain that heuristic (20) will always hold for such nodes. This can be proved as follows. Let (hi) be a \(\delta (\hat{s}+1)\)-optimal node, we have:

$$\begin{aligned} f(x_{h,i})+\delta (\hat{s}+1)\ge & {} f^* \end{aligned}$$
(91)
$$\begin{aligned} f(x_{h,i}) + \max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\ge & {} f_{max} \end{aligned}$$
(92)

Thus, (20) holds for \(\delta (\hat{s}+1)\)-optimal nodes and they will always get expanded. Since there are \(|{\mathcal {I}}^{\delta (\hat{s}+1)}|\) of such nodes, we are certain that the optimal node at pseudo-depth \(\hat{s}+1\) is expanded after at most \(|{\mathcal {I}}^{\delta (\hat{s}+1)}|(s_{max}-1)\) expansions. Thus, \(h^*_n\ge \hat{s}+1\). \(\square \)

The next theorem builds on Lemma 5 to present a finite-time analysis of MCS with an assumption on the structure of a node’s gain. To the best of our knowledge there is no finite-time analysis of MCS with/without local search (only the consistency property \(\lim _{n\rightarrow \infty }r(n)=0\) for MCS without local search in [27]).

Theorem 5

(Convergence of MCS) Assuming \(\max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\ge \delta (s), \forall \delta (s)\)-optimal node \(\in {\mathcal {T}}\) where \(0\le s \le s_{max}-1\), let us write s(n) the smallest \(s\in {\mathbb {Z}}^+_0\) such that

$$\begin{aligned} C(s_{max}-1)\sum _{l=0}^{s(n)} \rho ^{-ld_v}\ge n \end{aligned}$$
(93)

where \(d_v\) is the v-near-optimality dimension. Then the regret of MCS without local search is bounded as

$$\begin{aligned} r(n)\le \delta (\min (s(n),s_{max})) \end{aligned}$$

Proof

From Lemma 1, and the definition of s(n) (93), we have \(|{\mathcal {I}}^{\delta (s)}|\le C\rho ^{-sd_{v}}\) and:

$$\begin{aligned} \sum _{l=0}^{s(n)-1}|{\mathcal {I}}^{\delta (l)}|(s_{max}-1)\le C(s_{max}-1)\sum _{l=0}^{s(n)-1} \rho ^{-ld_v}< n \end{aligned}$$
(94)

which implies from Lemma 5 and \(\texttt {depth}({\mathcal {T}})\le s_{max}\) that \(h^*_n\ge \min (s(n)-1,s_{max}-1))\). Thus, from (30), we have \(r(n)\le \delta (\min (s(n),s_{max}))\). \(\square \)

4.2.5 Analysis of SOO

The b-value of a node in SOO is its local score, with \(l_{h,i}=f(x_{h,i}), \lambda =0\), and \(g_{h,i}=-h\). Similar to MCS, the global score is used as a heuristic to filter the nodes considered for expansion at an iteration. Given that the optimal node at depth \(\acute{h}-1\) has been expanded, for any node \((\acute{h},i)\) to be expanded before the optimal node \((\acute{h},i^*)\), the following must hold:

$$\begin{aligned}&b_{(\acute{h},i)} \ge b_{(\acute{h},i^*)} \end{aligned}$$
(95)
$$\begin{aligned}&f(x_{\acute{h},i}) + \delta (\acute{h}) \ge f^* \end{aligned}$$
(96)

For such depths, SOO may expand \(\delta (h)\)-optimal nodes . However, in contrary to DOO, SOO may expand non-\(\delta (h)\)-optimal nodes at depths \(\acute{h}<h\le \texttt {depth}({\mathcal {T}})\). Nevertheless, the relationship between \(h^*_n\) and n can be established due to SOO’s strategy in sweeping \({\mathcal {T}}\). This is demonstrated in [38, Lemma2]. With respect to the theoretical framework, [38] provides the analysis of SOO for \(\delta (h)<1\). We provide a generalized analysis (including \(\delta (h)\ge 1\)) by modifying [38, Theorem 2]:

Theorem 6

(Convergence of SOO) Let us write h(n) the smallest \(h\in {\mathbb {Z}}^+_0\) such that

$$\begin{aligned} Ch_{max}(n)\sum _{l=0}^{h(n)} \rho ^{-ld_v}\ge n \end{aligned}$$
(97)

where \(d_v\) is the v-near-optimality dimension and n is the number of expansions. Then the regret of SOO is bounded as

$$\begin{aligned} r(n)\le \delta (\min (h(n),h_{max}+1)) \end{aligned}$$

Proof

In line with Lemma 1, we have \(|{\mathcal {I}}^{\delta (h)}_h|\le C\rho ^{-hd_{v}}\). Thus, from the definition of h(n) (97) and [38, Lemma 2], the following:

$$\begin{aligned} h_{max} \sum _{l=0}^{h(n)-1} |{\mathcal {I}}^{\delta (l)}_l| \le Ch_{max}(n)\sum _{l=0}^{h(n)-1} \rho ^{-ld_v}< n \end{aligned}$$

implies that \(h^*_n\ge \min (h(n)-1,h_{max}(n))\). Thus, from (30), we have \(r(n)\le \delta (\min (h(n),h_{max}+1))\). \(\square \)

Effect of \(h_{max}(n)\) SOO controls \({\mathcal {T}}\)’s exploration behavior (deep vs. broad) through a single parameter, namely \(h_{max}(n)\). Given a fixed budget of node expansions, greater \(h_{max}\) reduces the likelihood of expanding an optimal node in \({\mathcal {T}}\), and hence a greater regret bound. It is interesting to consider SOO’s behaviour when \(h_{max}(n)\) is set to \(\infty \). Although Theorem 6 may imply that, for any n, the regret is bounded as \(r(n)\le \delta (0)\)—i.e., by a constant—when \(h_{max}(n)=\infty \), this is not really the case for SOO. When \(h_{max}(n)\) is set to \(\infty \), the regret of SOO is related to number of iteration batches the algorithm may have for a number of expansions n. The next corollary establishes a regret bound for SOO with \(h_{max}(n)=\infty \). It exploits the fact that for an iteration batch \(I_t\), the number of expansions is less than or equal \(\texttt {depth}({\mathcal {T}}_{t-1})\); and after each batch of iterations, \({\mathcal {T}}\)’s depth is increased at most by one.

Corollary 1

( Convergence of SOO with \(h_{max}=\infty \) ) Let us define h(n) as is the smallest \(h \in {\mathbb {Z}}^+_0\) such that:

$$\begin{aligned} C\sum \limits _{l=0}^{h(n)}\rho ^{-ld_{v}} \ge n_I \end{aligned}$$
(98)

where \(n_I\) is the greatest positive integer number such that:

$$\begin{aligned} n_{I}(n_{I}+1) \le 2\cdot n \end{aligned}$$
(99)

where n is the number of expansions. Then the regret of SOO with \(h_{max}=\infty \) is bounded as:

$$\begin{aligned} r(n)\le \delta (h(n)) \end{aligned}$$
(100)
Table 1 Convergence rate of different MSO algorithms. These rates hold provided that Assumptions 12, &  3 are satisfied

Proof

Let \(h^*_n=\acute{h}\) after \(\acute{n}\) complete batches of iteration. Then, each of the next batches expands a \(\delta (\acute{h}+1)\)-optimal node (if any). Since there are \(|{\mathcal {I}}^{\delta (\acute{h}+1)}_h|\le C\rho ^{-(\acute{h}+1)d_{v}}\) of such nodes, we know, after at most \(\acute{n}+C\rho ^{-(\acute{h}+1)d_{v}}\) batches, that \(h^*_n\ge \acute{h}+1\). Now, for m batches of iterations, SOO can have at most \(\frac{m(m+1)}{2}\) expansions. Thus, from the definition of h(n) and (30), we have \(h^*_n\ge h(n)-1\) and \(r(n)\le \delta (h(n))\), respectively. \(\square \)

5 Discussion

This section presents an outline of the theoretical findings and complements it with optimization problems in practice. Table 1 summarizes the convergence rate in terms of the regret bound for the algorithms discussed. These bounds do not imply a comparative performance, but rather report their worst-case behavior. Each algorithm employs different partitioning rules for which \(\delta (h)\) can be different. Nevertheless, since LO and DOO are theoretical propositions, one could comment on their comparative performance. Based on Table 1, we can conclude the following:

  1. 1.

    While LO and DOO assumes the knowledge about f’s smoothness; for a Lipschitz-continuous function, DOO is more preferable than LO as the latter’s regret bound is double the former’s for the same number of expansions, provided that \(d_v=d_{v/2}\), not to mention that DOO comes with a less restrictive assumption of the function smoothness. In practice, both algorithms are inapplicable unless some approximations on the function smoothness are made.

  2. 2.

    If \(s_{max}=h_{max}+1\), then SOO and MCS without local search, following the same partitioning rule, share the same regret bound, under the assumption \(\max _{1\le j \le N} \hat{e}^{(h,i)}_{j}\ge \delta (s), \forall \delta (s)\)-optimal node \(\in {\mathcal {T}}\).

  3. 3.

    DIRECT has the most over-pessimistic regret bound requiring a number of expansions n that grows quadratically in the number of problem dimensions N.

5.1 A worked example

This section presents a worked example, where the loss measure bounds are computed using Symbolic Maths and compared with respect to the empirical results of the algorithms: DIRECT, SOO, and MCS.Footnote 12 Consider the function to be minimized \(f(x):{\mathbb {R}}^N\rightarrow {\mathbb {R}}=||x-x^*||^\alpha _{\infty }\) over the decision space \({\mathcal {X}}=[-1,1]^N\). As these algorithms evaluate their cells at the center, the decreasing sequence \(\delta (h)\) can be defined as \(K^{-3\alpha \lfloor {h/N}\rfloor }\) with a partitioning factor of \(K=3\) and \(v=1/2\). As shown in Fig. 4, the bounds of SOO and MCS are tightly following their empirical regrets. On the other hand, DIRECT’s pessimistic behavior makes the loss bound lagging behind.

Fig. 4
figure 4

The empirical convergence rate and its theoretical bound with respect to the number of function evaluations #f-evals for the algorithms DIRECT, SOO, and MCS in minimizing \(||x-x^*||^\alpha _\infty \)

Fig. 5
figure 5

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for all functions and subgroups in 5-D. The targets \(f_{target}=f^*+\triangle f\) are defined with \(\triangle f\in 10^{[-8..2]}\) such that the bestGECCO2009 artificial algorithm just not reached them within a given budget of \(k \cdot DIM\), with \(k \in \{0.5,1.2,3,10,50\}\). The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each selected target. The cross indicates the maximum number of function evaluations used by the corresponding algorithm

Fig. 6
figure 6

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for all functions and subgroups in 20-D. The targets \(f_{target}=f^*+\triangle f\) are defined with \(\triangle f\in 10^{[-8..2]}\) such that the bestGECCO2009 artificial algorithm just not reached them within a given budget of \(k \cdot DIM\), with \(k \in \{0.5,1.2,3,10,50\}\). The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each selected target. The cross indicates the maximum number of function evaluations used by the corresponding algorithm

5.2 Empirical validation on real-world problems

Several studies have been conducted to evaluate the performance of MSO algorithms in the literature. To complement the theoretical perspective of the paper, a compilation of numerical assessment of the algorithms DIRECT, SOO, and MCS is presented in Figs. 5 and 6; based on the experiments in [11, 44, 45], using the Comparing Continuous Optimizer (COCO) methodology [22], which comes with a testbed of 24 scalable noiseless functions [23] addressing such real-world difficulties as ill-conditioning, multi-modality, and dimensionality. A set of target function values is specified per function. The algorithms are evaluated based on the number of function evalutions required to reach a target. The Expected Running Time (ERT)—used for the assessment in Figs. 5 and 6— depends on a given target function value, \(f_{target}=f^*+\triangle f\), and is computed over all relevant trials as the number of function evaluations #f-evals executed during each trial while the best function value did not reach \(f_{target}\) (see [22], for more details).

One can see how the pessimistic regret bound of DIRECT is in line with its performance as dimensionality increases: solving around 70 % (20 %) of the problems for N \(=\) 5 (N \(=\) 20). Sharing similar regret bound under certain assumptions, MCS and SOO seem to have comparable performance. However, as MCS employs heuristic rules for expansion, it enjoys faster convergence rate, especially with higher dimensions.

6 Conclusions

This paper has consolidated a broad category of algorithms that search for the (or one) optimal solution \(x^*\) by partitioning the search space \({\mathcal {X}}\) over multiple scales for solving bound-constrained black-box global optimization problems, under the multi-scale optimization (MSO) framework. Inline with MSO, we present a theoretical methodology to analyze these algorithms based on three basic assumptions: a) local Hölder continuity of the objective function f, b) partitions boundedness, and c) partitions sphericity. As a result, we are able to provide a theoretical bound on the regret of several state-of-the-art MSO algorithms, including LO, DOO, DIRECT, MCS and SOO.