Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

12.1 Introduction

In this chapter, we address the problem of finding the values of a set of design parameters that attain the optimum of an objective function, written in the following general form:

$$\displaystyle{ x^{{\ast}}\in \mathop{\mathrm{ arg\,max}}_{ x\in \varTheta }h(x), }$$
(12.1)

where Θ is the feasible region, which is often a non-empty compact subset of \(\mathbb{R}^{d}\), and \(h:\varTheta \rightarrow \mathbb{R}\) is a bounded, deterministic objective function. Stochastic optimization often refers to the case where the objective function h itself takes the form of an expectation

$$\displaystyle{h(x) = \mathsf{E}[H(x,\xi )],}$$

where ξ is a random variable representing the stochastic uncertainty of the system, which for example could be a sample path, and only estimates of the “noisy” sample performance H are available. In this chapter, simulation optimization refers to the special case of stochastic optimization when the sample performance H is assessed in a path-wise manner through computer simulation. In contrast with deterministic optimization, simulation optimization problems are characterized by random uncertainties in their performance measures, and the objective functions are often highly nonlinear with respect to the underlying decision variables.

When the objective function is differentiable, a well-known class of methods for solving simulation optimization problems is stochastic approximation [4, 5, 28, 29], the topic of Chaps. 6 and 7. These methods mimic the classical gradient algorithms in deterministic optimization and rely on the estimation of the gradient of the objective function. Because they are gradient-based, these methods generally find local optimal solutions. Sample average approximation [14, 27] is another approach that often exploits structural information such as differentiability, linearity or convexity. The main idea of this approach is to transform a simulation optimization problem into a deterministic one by expending a large amount of simulation effort on each visited solution to obtain a precise estimate of the objective function value. The resulting deterministic counterpart of the original stochastic problem is then solved by a deterministic optimization algorithm.

Due to the limited structural knowledge for general simulation optimization problems, it is natural to adapt random search methods from deterministic optimization to these types of problems. A random search method is usually recursive and approximates the optimal solution by a sequence of iterates (e.g., candidate solutions, promising subsets, probability models) generated according to a specified random mechanism. These methods differ primarily in the type of iterates an algorithm produces and in the choices of the random strategy used to generate the iterates. Because random search methods typically only rely on the objective function values rather than structural information such as convexity and differentiability, they are robust, easy to implement, and can be applied to a broad range of optimization problems with very different characteristics.

From an algorithmic point of view, a random search algorithm can be broadly classified as being either instance-based or model-based [44]. In instanced-based algorithms, an iterate corresponds to a single or a subset of candidate solution(s), and the construction of new iterates depends explicitly on iterates generated in previous iterations. These include both population-based algorithms such as genetic algorithms [10], which produce a collection of candidate solutions at each iteration, and methods like nested partitions [36] that are based on repeatedly identifying a promising subset of the feasible region as the search proceeds. Currently, random search-based simulation optimization is primarily dominated by instance-based methods, with numerous algorithms proposed in the literature and their behaviors relatively well studied and understood. In addition to the two aforementioned methods, some typical examples include response surface methods [3], simulated annealing [26], tabu search [9], stochastic ruler methods [38], stochastic comparison [11], and the COMPASS algorithm [15]; see Chaps. 2, 10, and 11, as well as [2] for a review of this class of methods.

While the field of simulation optimization has significantly evolved in terms of instance-based algorithms, less attention has been devoted to the study of model-based methods. Unlike instance-based algorithms, the model-based methods are based on sampling candidate solutions from an intermediate (usually parameterized) probability distribution over the feasible region. The idea is to iteratively modify the distribution model based on the sampled solutions to bias the search towards regions containing high quality solutions. Therefore, each iterate in a model-based algorithm corresponds to a distribution function, which can be abstractly viewed as a “model” characterizing the promising regions of the solution space. Examples of this type of method include ant colony optimization (ACO) [8], estimation of distribution algorithms (EDAs) [30], annealing adaptive search (AAS) [32, 39], probability collectives (PCs) [37], and the cross-entropy (CE) method [35]. These algorithms have successfully solved some hard nonlinear, non-differentiable problems and are becoming increasingly prominent in deterministic optimization.

Unfortunately, because model-based algorithms are generally heuristic in nature, very few of them have found their way into the simulation optimization literature. In this chapter, we aim to stimulate new research ideas in this area by presenting some recently developed frameworks and approaches for the design and analysis of a class of model-based algorithms. As a starting point, we introduce these developments in the context of deterministic optimization with the objective function h being viewed as a “black box” that returns the exact function value for a specified solution. We then proceed by providing specific examples and algorithms to illustrate the key modifications needed, as well as issues and challenges involved in extending model-based algorithms to general simulation optimization settings.

12.2 A Brief Review of Model-Based Methods

The basic idea of model-based algorithms is to use a sequence of probability distribution functions to successively characterize the promising regions of the solution space. So in a model-based algorithm, it is the probability distribution rather than candidate solutions (as in an instance-based algorithm) that is propagated from one iteration to another. Most algorithms that fall into this category are iterative methods involving the basic steps of the framework below.

Basic Model-Based Optimization Framework

Step 1.:

Randomly generate a population of candidate solutions X (k) from g k , where g k is the probability distribution on Θ at the kth iteration.

Step 2.:

Evaluate/estimate the performance of generated candidate solutions.

Step 3.:

Update g k based on the performance of the sampled solutions in X (k) to construct a new distribution g k+1; increase k by 1 and reiterate from Step 1.

Note that since a population of candidate solutions is generated at each step, such algorithms retain the primary strengths of population-based approaches such as genetic algorithms, while providing more flexibility and robustness in exploring the entire feasible region (i.e., via sampling from g k ). Clearly, a major algorithmic question in model-based algorithms is how the update in Step 3 is carried out. Important practical implementation issues are the efficient construction/representation of the probability distribution g k and efficient sampling from g k over the feasible region Θ. How these issues are addressed is what differentiates particular approaches.

Pure random search [39] (see also Chap. 11) can be viewed as one of the simplest model-based methods, in the sense that g k is taken to be a fixed uniform distribution over Θ (not updated at all). The algorithm proceeds by generating a sequence of uniformly distributed random points over the feasible region and using the best candidate solution obtained thus far as an estimate of the optimal solution. Although the idea behind the algorithm is simple, the complexity of the algorithm increases exponentially with the dimension of the solution space.

A nontrivial improvement of pure random search is the annealing adaptive search (AAS) algorithm [32, 39], described in detail in Chap. 11, which replaces the fixed uniform distribution in a pure random search method by a sequence of Boltzmann distributions parameterized by iteration-varying temperatures T k . These Boltzmann distributions are constructed in such a way that as the temperature decreases to zero, the sequence of distributions will become more concentrated on the set of optimal solutions. So solutions sampled from Boltzmann distributions with small values of T k will be close to the optimum with high probability. For the class of Lipschitz optimization problems, it has been shown that the expected number of iterations required by AAS to achieve a given level of precision increases at most linearly in the problem dimension [32, 39]. However, the idealized AAS is not readily implementable in practice for solving optimization problems, because the problem of sampling exactly from a given Boltzmann distribution is known to be very difficult. This implementation issue has been addressed in a number of papers (see e.g., [39, 40] and the references therein), and the basic idea is to use Markov chain Monte Carlo techniques to sample asymptotically from the Boltzmann distribution.

The cross-entropy (CE) method [35] was originally motivated by the problem of estimating probabilities of rare events in simulation [33], before it was discovered that it could be modified to solving deterministic optimization problems. The key idea of CE is to use a family of parameterized distributions to successively approximate an optimal (importance sampling) distribution concentrated only on the set of (near) optimal solutions, which is carried out by iteratively estimating the optimal parameter that minimizes the Kullback–Leibler (KL) divergence between the parameterized distribution and the target optimal distribution. Since its introduction, there have been extensive developments regarding implementation and practical applications of CE (see [35]). Those that are particularly relevant to our discussion include the adaptation of CE to handle stochastic network combinatorial optimization problems [34], the application of the method to solving buffer allocation problems in a simulation-based environment [1], and the work of [31], which uses CE as a direct policy search approach to solving stochastic dynamic programming problems. The literature analyzing the convergence properties of the CE method is relatively sparse, with most of the existing results limited to specific settings; see, e.g., [13] for a convergence proof of a variational version of CE in the context of estimation of rare event probabilities, and [7] for probability one convergence proofs of CE for discrete optimization problems. General convergence and asymptotic rate results for CE were recently obtained in [20, 22] by relating the algorithm to recursions of stochastic approximation type (see Sect. 12.4).

Two other well-established model-based methods are the ant colony optimization (ACO) [8] and the estimation of distribution algorithms (EDAs) [30]. ACO was inspired by the behavior of a colony of biological ants, which are capable of solving shortest path problems by exchanging their local information indirectly through a certain chemical substance called pheromone. ACO is frequently applied to solving combinatorial problems, e.g., the traveling salesman problem. In such problems, the generation of candidate solutions (tours) is based on the series of random moves performed by a collection of artificial ants called agents, which are controlled by an empirical distribution constructed based on each agent’s local experience. ACO has been formally extended to stochastic settings for solving stochastic combinatorial optimization problems. One such extension is called the stochastic ant colony optimization (S-ACO) [12], which uses Monte-Carlo sampling to estimate the expectation involved in evaluating the objective function. The probability one convergence of S-ACO to the global optimal solution has been established in [12].

EDAs inherit the spirit of genetic algorithms (GAs), but eliminate the crossover and mutation operators to avoid the disruption of partial solutions. In EDAs, a new population of candidate solutions are generated according to the probability distribution induced or estimated from the promising solutions selected from the previous generation. Unlike CE, EDAs often take into account the interrelations between the underlying decision variables needed to represent the individual candidate solutions. At each iteration of the algorithm, a high-dimensional probabilistic model that better represents the interdependencies between the decision variables is induced; this step constitutes the most crucial and difficult part of the method. We refer the reader to [30] for a review of the way in which different probabilistic models are used as EDA instantiations. A proof of convergence of a class of EDAs, under the idealized infinite population assumption, can be found in [41].

There are many other model-based algorithms proposed for optimization. Some notable examples include probability collectives (PCs) [37], particle swarm optimization [25], the particle filtering approach [42], and gradient-guided stochastic search [43]. A complete description of all of them is outside the scope of this chapter. Instead, we will focus our discussions on some recently developed approaches that allow us to arrive at a systematic framework to study a class of model-based algorithms in a uniform manner. We review specific algorithms under the framework, present their asymptotic convergence properties, and discuss their adaptations to simulation optimization.

12.3 A Model Reference Optimization Framework

We begin with a description of the model reference adaptive search (MRAS) method introduced by Hu et al. [18], which will serve as a basis for subsequent discussions.

Model-based methods construct a sequence of distributions {g k } with some desired convergence properties (ideally including g k  → g as k → , with g being a limiting distribution assigning all of its probability mass to the set of optimal solutions). Some common approaches for constructing such a sequence include: (a) various proportional selection schemes—used in ACOs, EDAs, and PCs; (b) Boltzmann selection schemes—used in AAS and the continuous EDA algorithm in [6]; and (c) optimal importance sampling measure—primarily used in the CE method.

However, in all cases, the obvious difficulties are that the sequence {g k } often depends on h, which may not be available in any explicit form, and that the problem of sampling exactly from even a known (but arbitrary) distribution g k is in general intractable. In [18], a general approach called model reference adaptive search (MRAS) is proposed, where these difficulties are circumvented by sampling from a surrogate distribution that approximates g k . The idea of MRAS is to specify a family of parameterized distributions \(\{f_{\varphi }(\cdot ),\varphi \in \varPhi \}\) (with Φ being the parameter space) and then project g k onto the family to obtain a sequence of sampling distributions \(\{f_{\varphi _{k}}\}\) with desired convergence properties. The projection is carried out by finding an optimal parameter \(\varphi _{k}\) that minimizes the Kullback–Leibler (KL) divergence between g k and the parameterized family \(\{f_{\varphi }(\cdot ),\varphi \in \varPhi \}\) (cf. also [35]), i.e.,

$$\displaystyle{\varphi _{k} =\mathop{ \mathrm{ arg\,min}}_{\varphi \in \varPhi }\mathcal{D}(g_{k},f_{\varphi }),}$$

where

$$\displaystyle{ \mathcal{D}(g_{k},f_{\varphi }):=\int _{\varTheta }\ln \frac{g_{k}(x)} {f_{\varphi }(x)} g_{k}(x)\nu (dx) = \mathsf{E}_{g_{k}}\bigg[\ln \frac{g_{k}(X)} {f_{\varphi }(X)} \bigg], }$$
(12.2)

ν is the Lebesgue/discrete measure on Θ, and \(\mathsf{E}_{g}[\cdot ]\) is the expectation taken with respect to the density/mass functions g. The hope is that the parameterized family is specified with some structure so that once the parameter is determined, sampling from each of these distributions should be a relatively easy task. Moreover, the task of updating the entire distribution can be simplified to the task of updating its associated parameters, and the sequence {g k }, henceforth referred to as reference distributions, is only used implicitly to guide the parameter updating procedure.

Note that there are other ways of constructing surrogate distributions. For example, in AAS, Markov chain Monte Carlo (MCMC) techniques are frequently used to approximate the target Boltzmann distribution at each iteration [32], and in traditional EDAs, empirical distribution models are directly constructed to generate new candidate solutions [30, 41]. However, these algorithms can also be accommodated within the above model reference framework by projecting the target distributions onto a family of parameterized distributions. Specifically, under such a framework, the three example sequences of reference distributions {g k } can be expressed in the following recursive forms:

  1. (a)

    proportional selection scheme:  \(g_{k}(x) = \frac{S(h(x))g_{k-1}(x)} {\mathsf{E}_{g_{k-1}}[S(h(X))]};\)

  2. (b)

    Boltzmann selection:

    $$\displaystyle{g_{k}(x) = \frac{e^{h(x)/T_{k}}} {\int _{x\in \varTheta }e^{h(x)/T_{k}}\nu (dx)} = \frac{e^{h(x)( \frac{1} {T_{k}}- \frac{1} {T_{k-1}} )}g_{ k-1}(x)} {\mathsf{E}_{g_{k-1}}\big[e^{h(X)( \frac{1} {T_{k}}- \frac{1} {T_{k-1}} )}\big]}:= \frac{S(h(x))g_{k-1}(x)} {\mathsf{E}_{g_{k-1}}[S(h(X))]},}$$

    with {T k } being a sequence of parameters determined by an annealing schedule;

  3. (c)

    importance sampling measure:  \(g_{k}(x) = \frac{S(h(x))f_{\varphi _{k}}(x)} {\mathsf{E}_{\varphi _{k}}[S(h(X))]},\)

where S(⋅ ) is a non-negative increasing (possibly iteration-varying) function, and \(\mathsf{E}_{\varphi }[\cdot ]\) is the expectation taken with respect to \(f_{\varphi }\). The algorithm instantiation considered in [18] uses the recursive procedure corresponding to case (a) to construct the g k sequence. This form of reference distributions weights g k by the value of the performance measure by giving more mass to solutions with good performance. The resulting g k+1 has the property that it improves the expected performance of g k , since

$$\displaystyle{\mathsf{E}_{g_{k+1}}[S(h(X))] = \frac{\mathsf{E}_{g_{k}}[S^{2}(h(X))]} {\mathsf{E}_{g_{k}}[S(h(X))]} \geq \mathsf{E}_{g_{k}}[S(h(X))].}$$

This property ensure the convergence of the sequence {g k } to a degenerate distribution concentrated on the set of optimal solutions. The general structure of the MRAS method is outlined below.

Basic Model Reference Adaptive Search (MRAS) Optimization Framework

Step 0.:

Select a parameterized family \(\{f_{\varphi }\}\) and the {g k } sequence with desired convergence properties.

Step 1.:

Given \(\varphi _{k}\), generate N candidate solutions \(X_{k}^{1},\ldots,X_{k}^{N}\) by sampling from \(f_{\varphi _{k}}\).

Step 2.:

Update the parameter \(\varphi _{k+1}\) based on the sampled solutions by minimizing the KL-divergence

$$\displaystyle{\varphi _{k+1} =\mathop{ \mathrm{ arg\,min}}_{\varphi \in \varPhi }\mathcal{D}(g_{k+1},f_{\varphi });}$$

set k ← k + 1 and reiterate from Step 1 until a stopping criterion is satisfied.

In some model-based algorithms such as CE and EDAs, there is often an additional selection step embedded in the above procedure. The idea is to concentrate the computational effort on the set of promising solutions by using only a portion of the samples—the set of “elite” samples—to update the probability model at Step 2. Also note that in general, the expectation involved in the KL-divergence (cf. (12.2)) cannot be evaluated exactly. So in practice, the objective function in the minimization problem at Step 2 is often replaced by its sample average approximation. This leads to an estimator of \(\varphi _{k+1}\) that is biased for any finite sample size N. This bias issue will be discussed later in Sect. 12.5.

12.3.1 Convergence Result

For distributions in the natural exponential family (NEF), the minimization in Step 2 of the basic MRAS framework can be carried out in analytical form, which makes the method easy to implement efficiently.

Definition 12.1.

A parameterized family \(\{f_{\varphi }(\cdot ),\varphi \in \varPhi \subseteq \mathbb{R}^{m}\}\) on Θ is called a natural exponential family if there exist mappings \(\varGamma: \mathbb{R}^{d} \rightarrow \mathbb{R}^{m}\) and \(K: \mathbb{R}^{m} \rightarrow \mathbb{R}\) such that \(f_{\varphi }(x) =\exp \big (\varphi ^{T}\varGamma (x) - K(\varphi )\big)\), where \(K(\varphi ) =\ln \int _{\varTheta }\exp (\varphi ^{T}\varGamma (x))\nu (dx)\) is a normalization constant.

The NEF has many interesting properties. For the purpose of our discussion, we recall that \(K(\varphi )\) is strictly convex in the interior of Φ and the mean parameter function defined by

$$\displaystyle{ m(\varphi ):= \nabla _{\varphi }K(\varphi ) = \mathsf{E}_{\varphi }[\varGamma (X)] }$$
(12.3)

is a one-to-one invertible mapping of \(\varphi\). Intuitively, \(m(\varphi )\) is essentially a transformed version of the sufficient statistic Γ(x), whose value contains all information necessary in estimating the parameter \(\varphi\). For example, in the univariate normal distribution \(\mathcal{N}(\mu,\sigma ^{2})\) with mean μ and variance σ 2, it can be seen that Γ(x) = (x, x 2)T and \(\varphi = (\mu /\sigma ^{2},-1/(2\sigma ^{2}))^{T}\). Thus, given the value of \(m(\varphi )\), the equation \(m(\varphi ) = \mathsf{E}_{\varphi }[\varGamma (X)] = (\mu,\sigma ^{2} +\mu ^{2})^{T}\) can be uniquely solved for μ and σ 2.

When NEFs are used in the framework with the sample size N adaptively increasing, the convergence result for the instantiation of MRAS considered in [18] takes the form

$$\displaystyle{\lim _{k\rightarrow \infty }m(\varphi _{k}) =\varGamma (x^{{\ast}})\ \ w.p.1.\ \ \mbox{ as $k \rightarrow \infty $.}}$$

Since \(m(\varphi )\) is one-to-one, this shows that the sequence of sampling distributions \(\{f_{\varphi _{k}}\}\) will converge to a limiting distribution \(f_{\varphi ^{{\ast}}}\) for g k sequences of proportional selection scheme type. In addition, it has been argued in [18] that in many special cases of interest, the limiting distribution turns out to be a degenerate distribution on the set of optimal solutions. For example, when multivariate normal distributions with mean vector μ k and covariance matrix Σ k are used as parameterized distributions, the convergence result translates to \(\lim _{k\rightarrow \infty }\mu _{k} = x^{{\ast}}\) and \(\lim _{k\rightarrow \infty }\varSigma _{k} = \mathbf{0}_{n\times n}\) w.p.1, where 0 n×n represents an n-by-n zero matrix.

12.3.2 Simulation Optimization

In [19], the MRAS method has been generalized to stochastic settings where the objective function h(x) in (12.1) can only be estimated, e.g., via a simulation model or real-time observations. We begin by providing a high-level description of the method.

Stochastic MRAS Framework for Simulation Optimization

Step 0.:

Select a parameterized family \(\{f_{\varphi }\}\) and an idealized {g k } sequence with desired convergence properties.

Step 1.:

Given \(\varphi _{k}\), generate N candidate solutions \(\varLambda _{k}:=\{ X_{k}^{1},\ldots,X_{k}^{N}\}\) by sampling from \(f_{\varphi _{k}}\).

Step 2.:

Take M k simulation observations for each x ∈ Λ k and estimate h(x) by the sample average \(\overline{H}(x):= \frac{1} {M_{k}}\sum _{j=1}^{M_{k}}H(x,\xi _{ j})\), where ξ j is the simulation noise in the jth replication run.

Step 3.:

Update the parameter \(\varphi _{k+1}\) based on the sampled solutions by minimizing the KL-divergence

$$\displaystyle{\varphi _{k+1} =\mathop{ \mathrm{ arg\,min}}_{\varphi \in \varPhi }\mathcal{D}(\tilde{g}_{k+1},f_{\varphi });}$$

set k ← k + 1 and reiterate from Step 1 until a stopping criterion is satisfied.

The basic structure of the stochastic MRAS method (SMRAS) is similar to that of MRAS for deterministic optimization, the main addition being the requirement of an additional performance estimation step (i.e., Step 2 above). So in addition to the sample size N used in MRAS, we also need to specify the number of simulation replications to be allocated to each sampled candidate solution. At each iteration, the sample mean based on M k observations is used to estimate the true performance h(x). Another modification from the original MRAS method occurs at Step 3, where a distribution \(\tilde{g}_{k}\) is used as an approximation of the idealized g k distribution in minimizing the KL-divergence. The sequence \(\{\tilde{g}_{k}\}\) is obtained by replacing the true objective function h(x) in the construction of g k with its sample average approximation \(\overline{H}(x)\).

There are two general types of approaches for handling the simulation noise involved in evaluating the objective function. One type of approaches (e.g., sample average approximation) relies on highly precise estimates of the objective function values by allocating a significant amount of simulation replications to each visited solution. The other type of approaches (e.g., stochastic approximation) does not require precise performance estimates, but generally involves some forms of averaging, so that the estimation error due to simulation noise will automatically cancel out over the course of a large number of iterations. The SMRAS method falls in between these two types of approaches and requires increasingly precise estimates of the objective function values as the search progresses. In particular, for the proportional selection scheme type {g k } sequence, conditions on {M k } that ensure the convergence of the method are provided in [19]. It has been shown that when the simulation noises (not necessarily i.i.d.) satisfy the large deviation principle, M k is required to grow linearly with the number of iterations k.

As compared with the MRAS method for deterministic optimization, the updating formula for \(\varphi _{k+1}\) (or equivalently \(m(\varphi _{k+1})\), due to the invertibility of the mapping m) in SMRAS has an additional simulation error term. For general reference distribution sequences, a large deviation approach similar to that of [19] can be applied to determine the conditions on {M k } under which the error term converges to zero with probability one. Again, when NEFs are used in the SMRAS framework and {g k } is constructed using the proportional selection scheme, essentially the same convergence result as stated in Sect. 12.3.1, i.e.,

$$\displaystyle{\lim _{k\rightarrow \infty }m(\varphi _{k}) =\varGamma (x^{{\ast}})\ \ w.p.1.\ \ \mbox{ as $k \rightarrow \infty $,}}$$

has been established under some appropriate conditions on the algorithm input parameters, the sample size N, and the number of simulation replications M k ; see [19].

12.4 A Stochastic Approximation Framework

In this section, we present a stochastic approximation (cf. Chap. 6) framework to analyze a general class of model-based algorithms. We show that a slight modification of the MRAS method introduced in Sect. 12.3 will lead to an interesting connection between model-based algorithms and the well-known stochastic approximation method. This connection implies that for a general non-differentiable (deterministic) optimization problem, a model-based algorithm implicitly transforms the underlying problem into an equivalent stochastic optimization problem with smooth differentiable structures, and the algorithm itself can be viewed as a gradient-based recursion over a transformed parameter space for solving the equivalent smoothed problem. This interpretation of model-based algorithms not only explains why these algorithms work well for hard optimization problems with little structure, but also allows us to investigate their theoretical properties such as convergence and rate of convergence by using theory and tools from stochastic approximation.

The main idea of the stochastic approximation framework is to replace the reference distribution sequence {g k } in the original MRAS method by a sequence \(\{\hat{g}_{k}\}\) of the form:

$$\displaystyle{ \hat{g}_{k+1}(x) =\alpha _{k}g_{k+1}(x) + (1 -\alpha _{k})f_{\varphi _{k}}(x),\ \ \alpha _{k} \in [0,1]\ \forall \,k, }$$
(12.4)

where \(f_{\varphi _{k}}\) is the sampling distribution obtained at the kth iteration of an algorithm and α k is a mixture coefficient. Note that by taking \(\hat{g}_{k+1}\) as the weighted average of g k+1 and \(f_{\varphi _{k}}\), it is “forced” to stay close to both distributions. Thus, if \(\hat{g}_{k+1}\) is used in place of g k+1 in minimizing the KL-divergence \(\mathcal{D}(\hat{g}_{k+1},f_{\varphi })\) at Step 2 of the MRAS method, the new sampling distribution \(f_{\varphi _{k+1}}\) obtained will not deviate significantly from the current sampling distribution \(f_{\varphi _{k}}\). Since an NEF distribution \(f_{\varphi }\) is uniquely characterized by its associated parameter \(\varphi\), the above intuition can be formally stated in terms of the difference between the two successive mean parameter functions (cf. (12.3)) of the projected probability distributions [20, 22]:

$$\displaystyle{ m(\varphi _{k+1}) - m(\varphi _{k}) = -\alpha _{k}\nabla _{\varphi }\mathcal{D}(g_{k+1},f_{\varphi })\vert _{\varphi =\varphi _{k}}. }$$
(12.5)

Equation (12.5) explicitly brings out the updating direction of the parameter functions at each step, which is in the direction of the negative gradient of the iteration-varying objective function for the minimization problem \(\min _{\varphi \in \varPhi }\mathcal{D}(g_{k+1},f_{\varphi })\ \forall \,k\). This suggests that regardless of the type of decision variables involved in the original problem (12.1), algorithms conforming to the framework are essentially gradient-based recursions for solving a sequence of optimization problems on the parameter space Φ with smooth differentiable structures. In the special case of the CE method, i.e., when g k+1 in the right-hand-side of recursion (12.4) is replaced with \(S(h(x))f_{\varphi _{k}}(x)/\mathsf{E}_{\varphi _{k}}[S(h(X))]\), it can be seen that (12.5) becomes

$$\displaystyle{ m(\varphi _{k+1}) - m(\varphi _{k}) =\alpha _{k}\frac{\mathsf{E}_{\varphi _{k}}[S(h(X))(\varGamma (X) - m(\varphi _{k}))]} {\mathsf{E}_{\varphi _{k}}[S(h(X))]} =\alpha _{k}\nabla _{\varphi }\ln \mathsf{E}_{\varphi }[S(h(X))]\big\vert _{\varphi =\varphi _{k}}. }$$
(12.6)

So the updating direction is in the gradient of the objective function for the maximization problem \(\max _{\varphi \in \varPhi }\ln \mathsf{E}_{\varphi }[S(h(X))]\). The optimal solution of this optimization problem is a parameter \(\varphi ^{{\ast}}\) whose associated sampling distribution \(f_{\varphi ^{{\ast}}}\) assigns maximum probability to the set of optimal solutions of (12.1).

Letting \(\eta _{k}:= m(\varphi _{k})\) and using the invertibility of the mapping m, we can write (12.5) in the abstract form

$$\displaystyle{ \eta _{k+1} =\eta _{k} -\alpha _{k}L(\eta _{k}), }$$
(12.7)

where L(η k ) represents the gradient of the underlying (possibly iteration-varying) objective function at η k , which, for the three example {g k } sequences discussed in Sect. 12.3, takes the general form

$$\displaystyle{ L(\eta _{k}) = \frac{\mathsf{E}_{\varphi _{k}}[S(h(X))G(X,\eta _{k})]} {\mathsf{E}_{\varphi _{k}}[S(h(X))]} }$$
(12.8)

for some appropriate function G(x, η k ). In actual implementation, expectations are replaced by sample averages based on Monte Carlo sampling, (12.7) becomes stochastic approximation with direct gradient estimation:

$$\displaystyle{ \tilde{\eta }_{k+1} =\tilde{\eta } _{k} -\alpha _{k}\tilde{L}(\tilde{\eta }_{k}), }$$
(12.9)

where \(\tilde{L}\) is an estimator for L based on sampled candidate solutions. The most straightforward estimator is

$$\displaystyle{\tilde{L}(\tilde{\eta }_{k}) = \frac{N^{-1}\sum _{i=1}^{N}S(h(X_{k}^{i}))G(X_{k}^{i},\tilde{\eta }_{k})} {N^{-1}\sum _{i=1}^{N}S(h(X_{k}^{i}))}.}$$

Thus, it is clear that the rich body of tools and results from stochastic approximation [28, 29] can be incorporated into the framework to analyze model-based algorithms.

12.4.1 Convergence Results

12.4.1.1 Convergence of the CE Method

The convergence of the CE algorithm has been studied in [20, 22] by writing (12.9) in the form of a generalized Robbins–Monro algorithm in terms of the true gradient, a bias term, and a noise term caused by Monte Carlo random sampling, and then following a standard ordinary differential equation (ODE) argument (cf. e.g., [4, 5, 28]). Basically, it has been shown in [22] that the asymptotic behavior of CE is governed by the properties of a limit set of an underlying ODE. In addition, if the limit set consists purely of isolated equilibrium points of the ODE, then the sequence of \(\{\tilde{\eta }_{k}\}\) generated by CE will converge to a unique limiting point η w.p.1. Under such a condition, the following asymptotic convergence rate result has also been established in [22]:

$$\displaystyle{k^{ \frac{\tau }{ 2} }(\tilde{\eta }_{k} -\eta ^{{\ast}})\mathop{\longrightarrow}\limits_{}^{\ d\ }\mathcal{N}\big(0,\varSigma \big)\ \ \mbox{ as $k \rightarrow \infty,$}}$$

where τ ∈ (0, 1) is an appropriate constant and Σ is a positive definite covariance matrix.

12.4.1.2 Model-Based Annealing Random Search (MARS)

The stochastic approximation framework also allows for a lot of flexibility in developing provably convergent algorithms that perform well in practice. For example, [21] has investigated the use of Boltzmann distributions as reference models in the framework to address the implementation difficulty of annealing adaptive search, leading to a new globally convergent algorithm called model-based annealing random search (MARS). The algorithm complements existing research based on MCMC sampling techniques in the sense that it samples candidate solutions from a sequence of NEF distributions that approximates the target Boltzmann distributions, whereas MCMC techniques are sequential sampling procedures that sample directly from the Boltzmann distributions. The major steps of the MARS algorithm are given next.

Basic MARS Algorithm

Step 0.:

Select an NEF distribution family \(\{f_{\varphi },\varphi \in \varPhi \}\), a sequence of temperature parameters {T k }, and a gain sequence \(\{\alpha _{k}\}\).

Step 1.:

Given \(\varphi _{k}\), generate N candidate solutions \(X_{k}^{1},\ldots,X_{k}^{N}\) by sampling from \(f_{\varphi _{k}}.\)

Step 2.:

Update the parameter

$$\displaystyle{\varphi _{k+1} =\mathop{ \mathrm{ arg\,min}}_{\varphi \in \varPhi }\mathcal{D}(\hat{g}_{k+1},f_{\varphi });}$$

set k ← k + 1 and reiterate from Step 1.

At Step 2 of MARS, the reference distribution \(\hat{g}_{k+1}\) is given by \(\hat{g}_{k+1}(x) =\alpha _{k}\bar{g}_{k+1}(x) + (1 -\alpha _{k})f_{\varphi _{k}}(x),\) which is the mixture of the current sampling distribution \(f_{\varphi _{k}}\) with \(\bar{g}_{k+1}\), an empirical estimate of the true Boltzmann distribution \(g_{k+1}(x):= \frac{e^{h(x)}/T_{ k}} {\int _{\mathbf{X}}e^{h(x)/T_{k}}dx}\) based on the sampled solutions X k 1, , X k N.

In light of Eq. (12.5), the mean parameter function \(m(\varphi _{k+1})\) corresponding to the new parameter \(\varphi _{k+1}\) obtained at Step 2 of MARS can be viewed as an iterate generated by the gradient recursion

$$\displaystyle{ m(\varphi _{k+1}) = m(\varphi _{k}) -\alpha _{k}\nabla _{\varphi }\mathcal{D}(\bar{g}_{k+1},f_{\varphi })\vert _{\varphi =\varphi _{k}}. }$$
(12.10)

By the properties of NEFs, the gradient in (12.10) can be expressed in terms of a true gradient term involving the Boltzmann distribution g k+1 and an error term due to random sampling, leading to a Robbins–Monro type stochastic approximation algorithm

$$\displaystyle\begin{array}{rcl} m(\varphi _{k+1})& =& m(\varphi _{k}) -\alpha _{k}\Big(m(\varphi _{k}) -\mathsf{E}_{g_{k+1}}[\varGamma (X)] + \mathsf{E}_{g_{k+1}}[\varGamma (X)] -\mathsf{E}_{\bar{g}_{k+1}}[\varGamma (X)]\Big) \\ & & = m(\varphi _{k}) -\alpha _{k}\nabla _{\varphi }\mathcal{D}(g_{k+1},f_{\varphi })\vert _{\varphi =\varphi _{k}} -\alpha _{k}\big(\mathsf{E}_{g_{k+1}}[\varGamma (X)] -\mathsf{E}_{\bar{g}_{k+1}}[\varGamma (X)]\big).{}\end{array}$$
(12.11)

Note that (12.11) generalizes a typical stochastic approximation recursion in that the function \(\mathcal{D}(g_{k+1},f_{\varphi })\) may change shape with k. This time-varying feature of MARS actually turns out to be a desirable property, because the idealized sequence of Boltzmann distributions {g k } converges to a limiting distribution g as k goes to infinity. This will in turn imply the convergence of the sequence of the optimal solutions \(\{\varphi _{k}\}\) to a global optimizer \(\varphi ^{{\ast}}\). Under some appropriate conditions on the algorithm input parameters, the following convergence result has been obtained in [21]:

$$\displaystyle{\lim _{k\rightarrow \infty }m(\varphi _{k}) =\varGamma (x^{{\ast}})\ \ w.p.1.}$$

In addition, for a polynomially increasing sample size N = ak β and a gain sequence of the form \(\alpha _{k} = c/k^{\alpha }\) for constants a > 0, c > 0, \(\alpha \in (\frac{1} {2},1)\), and β > α, an asymptotic normality result of the following form is also given in [21]:

$$\displaystyle{k^{\frac{\alpha +\beta } {2} }\big(m(\varphi _{k}) -\varGamma (x^{{\ast}})\big)\mathop{\longrightarrow}\limits_{}^{\ d\ }\mathcal{N}(0,\varSigma )\ \ \mbox{ as $k \rightarrow \infty $},}$$

where Σ is a positive definite covariance matrix.

12.4.2 Simulation Optimization

The simulation optimization setting, where H(x, ξ) is obtained in a simulation replication, requires an additional simulation allocation rule {M k }, which allocates M k simulation observations to each of the N candidate solutions generated at the kth iteration. Thus, in constructing gradient estimators, if the true performance at a sampled solution \(X_{k}^{i}\) is replaced by the sample average

$$\displaystyle{\overline{H}_{k}(X_{k}^{i}) = \frac{1} {M_{k}}\sum _{j=1}^{M_{k} }H(X_{k}^{i},\xi _{ j}),}$$

then an estimator of the true gradient \(L(\tilde{\eta }_{k})\) in (12.8) will take the form

$$\displaystyle{\overline{L}_{N}(\tilde{\eta }_{k}) = \frac{N^{-1}\sum _{i=1}^{N}S(\overline{H}_{k}(X_{k}^{i}))G(X_{k}^{i},\tilde{\eta }_{k})} {N^{-1}\sum _{i=1}^{N}S(\overline{H}_{k}(X_{k}^{i}))}.}$$

Consequently, the gradient iteration (12.9) can be carried out at each step by replacing the true performance at a sampled solution \(X_{k}^{i}\) by its sample average approximation \(\overline{H}_{k}(X_{k}^{i})\), leading to a recursion of the form

$$\displaystyle{ \tilde{\eta }_{k+1} =\tilde{\eta } _{k} -\alpha _{k}\tilde{L}_{N}(\tilde{\eta }_{k}) +\alpha _{k}\big(\tilde{L}_{N}(\tilde{\eta }_{k}) -\overline{L}_{N}(\tilde{\eta }_{k})\big). }$$
(12.12)

Under some appropriate conditions on the allocation rule {M k }, the expectation (conditional on the current sampled solutions) of the error term \(\tilde{L}_{N}(\tilde{\eta }_{k}) -\overline{L}_{N}(\tilde{\eta }_{k})\) goes to 0 as k → . Thus by treating the simulation noise \(\tilde{L}_{N}(\tilde{\eta }_{k}) -\overline{L}_{N}(\tilde{\eta }_{k})\) as a vanishing bias term, the (possibly local) convergence and convergence rate analysis of recursion (12.12) can be studied along the same line as in the deterministic optimization case.

12.4.2.1 Application of MARS to Finite-Horizon Markov Decision Processes

To illustrate the adaptation of the stochastic approximation framework to simulation setting, we modify and extend the MARS algorithm to a stochastic setting and present a simulation-based algorithm called approximate stochastic annealing (ASA) for solving finite-horizon Markov decision processes (MDPs) [16]. The idea is to interpret an MDP as a stochastic optimization problem on the (randomized) policy space (where candidate solutions are policies) and then use ASA as a specific optimization strategy to directly search the policy space to find good policies.

Consider a discrete-time finite \(\mathcal{H}\)-horizon stochastic system \(x_{t+1} = f(x_{t},a_{t},w_{t})\) for \(t = 0,1,\ldots,\mathcal{H}- 1\), where x t represents the system state at time t taking values from a finite state space \(\mathcal{X}\), a t is the control applied at time t chosen from a finite action set \(\mathcal{A}\), {w t } is a sequence of random vectors representing the stochastic uncertainty of the system, and f is the next-state transition function. Let R t (x t , a t , w t ) be the one-stage reward for action a t taken in state x t at time t. Define Π as the set of non-stationary deterministic Markovian policies \(\pi =\{\pi _{t},\ t = 0,\ldots,\mathcal{H}- 1\}\), where each \(\pi _{t}: \mathcal{X} \rightarrow \mathcal{A}\) is a function that specifies the action to be applied at time t for each \(x \in \mathcal{X}\). For an initial state x 0 = x, the expected total reward (value function) associated with a policy π is given by

$$\displaystyle{ V ^{\pi }(x):= \mathsf{E}\Big[\sum _{t=0}^{\mathcal{H}-1}R_{ t}(x_{t},\pi _{t}(x_{t}),w_{t})\big\vert x_{0} = x\Big]. }$$
(12.13)

The objective is to find an optimal policy π  ∈ Π that maximizes the expected total reward for a given state x, i.e.,

$$\displaystyle{ V ^{\pi ^{{\ast}} }(x) =\max _{\pi \in \varPi }V ^{\pi }(x). }$$
(12.14)

At each iteration, ASA searches for improved policies by sampling from a probability distribution function ϕ(π, q) over the policy space Π, where q is a parameter vector taking values from some parameter space. The distribution is then modified using a Boltzmann selection scheme based on the simulated/estimated value functions of the sampled policies. One simple way to specify the parameterized distribution ϕ(π, q) is to use an \(\vert \mathcal{X}\vert \)-by-\(\vert \mathcal{A}\vert \)-by-\(\mathcal{H}\) stochastic matrix q k , whose (i, j, t)th entry q k (i, j, t) specifies the probability that the action a j is applied to state x i at time t. Such a stochastic matrix q k gives rise to a probability mass function over Π:

$$\displaystyle{ \phi (\pi,q_{k}):=\prod _{ t=0}^{\mathcal{H}-1}\prod _{ i=1}^{\vert \mathcal{X}\vert }\prod _{ j=1}^{\vert \mathcal{A}\vert }\big[q_{ k}(i,j,t)\big]^{I\{\pi \in \varPi _{i,j}(t)\}}\ \ \forall \,\pi \in \varPi, }$$
(12.15)

where I{⋅ } is the indicator function and \(\varPi _{i,j}(t):=\{\pi:\pi _{t}(x_{i}) = a_{j}\}\) denotes the set of deterministic policies that assign action a j to state x i at time t. Thus, as in the MARS algorithm, the goal is to iteratively update the entries q k (i, j, t) so that ϕ(π, q k ) will be a close approximation to the desired Boltzmann distribution

$$\displaystyle{g_{k+1}(\pi ) = \frac{e^{V ^{\pi }/T_{ k}}} {\sum _{\pi \in \varPi }e^{V ^{\pi }/T_{k} }}.}$$

This results in the following adaptive policy search procedure.

Approximate Stochastic Annealing (ASA) Algorithm for Solving MDPs

Step 0.:

Select a sequence of temperature parameters {T k } and a gain sequence {α k }.

Step 1.:

Given q k , sample N policies \(\varLambda _{k}:=\{\pi _{ k}^{1},\ldots,\pi _{k}^{N}\}\) from ϕ(π, q k ). 

Step 2.:

For each π ∈ Λ k , perform M k independent simulation replication runs and let \(\bar{V }_{k}^{\pi } = \frac{1} {M_{k}}\sum _{l=1}^{M_{k}}V _{ l}^{\pi }\), where \(V _{l}^{\pi }\) is an estimate of the value function V π obtained in the lth replication run.

Step 3.:

Update the q matrix

$$\displaystyle{q_{k+1} =\mathop{ \mathrm{ arg\,min}}_{q}\mathcal{D}\big(\hat{g}_{k+1},\phi (\cdot,q)\big).}$$

Set k ← k + 1 and reiterate from Step 1.

Note that at Step 3, the KL-divergence is with respect to \(\hat{g}_{k+1}(\pi ) =\alpha _{k}\tilde{g}_{k+1}(\pi ) + (1 -\alpha _{k})\phi (\pi,q_{k})\), where \(\tilde{g}_{k+1}\) is an empirical estimate of the true Boltzmann distribution g k+1 based on the sampled policies in Λ k and value function estimates \(V _{l}^{\pi }\).

Since the parameterized distribution ϕ(π, q k ) given in (12.15) belongs to NEFs, it is not difficult to show that the entries of the q k matrix updated at Step 3 satisfy the following recursion:

$$\displaystyle\begin{array}{rcl} & & q_{k+1}(i,j,t) - q_{k}(i,j,t) =\alpha _{k}\Big[\mathsf{E}_{\tilde{g}_{k+1}}[I\{\pi \in \varPi _{i,j}(t)\}] - q_{k}(i,j,t)\Big] \\ & & \quad \qquad \qquad =\alpha _{k}\Big[\mathsf{E}_{g_{k+1}}[I\{\pi \in \varPi _{i,j}(t)\}] - q_{k}(i,j,t)\Big] \\ & & \qquad \qquad \qquad \qquad -\alpha _{k}\Big[\mathsf{E}_{g_{k+1}}[I\{\pi \in \varPi _{i,j}(t)\}] -\mathsf{E}_{\tilde{g}_{k+1}}[I\{\pi \in \varPi _{i,j}(t)\}]\Big]. {}\end{array}$$
(12.16)

Since g k+1 assigns more weight to policies with better performance, the first term on the right-hand-side of the second equality implies that the entries of q k are updated in a direction that “pursues” the optimal policy π , whereas the second term can be viewed as a noise term caused by the approximation error between \(\tilde{g}_{k+1}\) and g k+1. Thus the convergence analysis of ASA essentially boils down to the issue of inspecting whether the Boltzmann distribution g k+1 can be closely approximated by its empirical estimate \(\tilde{g}_{k+1}\). In particular, under some mild conditions on the algorithm input parameters, the following result is obtained in [16]:

$$\displaystyle{q_{k}(i,j,t) \rightarrow I\{\pi ^{{\ast}}\in \varPi _{ i,j}(t)\}\ \ \forall i,j,t\ \ \mbox{ as $k \rightarrow \infty $ w.p.1,}}$$

which indicates that the sequence of stochastic matrices q k generated by the algorithm will converge to a limiting matrix assigning unit probability mass to the optimal policy π .

12.5 A Stochastic Averaging Approach

As we have seen in previous sections, convergence analysis of model-based algorithms typically requires a sample size N that increases polynomially with the number of algorithm iterations. In practice, this translates to using a per-iteration computational effort that grows without bound as the number of iterations increases, which may have a negative impact on the algorithm’s practical performance, especially in the setting where simulation/function evaluations are expensive.

This efficiency issue has been tackled in [17], where the basic idea is to maintain a population of probability distribution models (rather than just a single model as in a typical model-based algorithm) at each iteration and then adaptively allocate a given computing budget among different models in order to maximize the expected performance of the algorithm. In this section, we present an approach that aims to improve the sampling efficiency of model-based algorithms from a different perspective, focusing primarily on reducing the number of candidate solutions generated per iteration. This is carried out through embedding a stochastic averaging procedure within model-based algorithms to make more efficient use of the past sampling information. The material in this section is based on [23, 24].

For simplicity, we consider the general reference distribution introduced in Sect. 12.3:

$$\displaystyle{g_{k}(x) = \frac{S(h(x))g_{k-1}(x)} {\mathsf{E}_{g_{k-1}}[S(h(X))]}.}$$

By expanding the above recursion, we can write g k in terms of the initial distribution g 0 as

$$\displaystyle{ g_{k}(x) = \frac{S_{k}(h(x))g_{0}(x)} {\mathsf{E}_{g_{0}}[S_{k}(h(X))]}, }$$
(12.17)

where S k is some appropriate iteration-varying function that depends on S. Substituting (12.17) into the minimization problem \(\min _{\varphi }\mathcal{D}(g_{k+1},f_{\varphi })\) and dropping terms that are constants with respect to \(\varphi\), it can be seen that the parameter \(\varphi _{k+1}\) obtained at Step 2 of the MRAS method of Sect. 12.3 can be equivalently obtained by solving the following optimization problem

$$\displaystyle{ \varphi _{k+1} =\mathop{ \mathrm{ arg\,max}}_{\varphi \in \Theta }\bigg(Q_{k+1}(\varphi ):=\int _{x\in \varTheta }S_{k+1}(h(x))\ln f_{\varphi }(x)dx\bigg). }$$
(12.18)

As discussed previously, in model-based algorithms, the integral involved in the Q-function \(Q_{k+1}(\varphi )\) is estimated by generating N i.i.d. candidate solutions \(X_{k}^{1},\ldots,X_{k}^{N}\) from \(f_{\varphi _{k}}\), and then replacing \(Q_{k+1}(\varphi )\) by its sample average approximation

$$\displaystyle{\bar{Q}_{k+1}(\varphi ):= \frac{1} {N}\sum _{i=1}^{N}\frac{S_{k+1}(h(X_{k}^{i}))} {f_{\varphi _{k}}(X_{k}^{i})} \ln f_{\varphi }(X_{k}^{i}).}$$

Although \(\bar{Q}_{k+1}(\varphi )\) is an unbiased estimator of \(Q_{k+1}(\varphi )\), the corresponding optimization step will lead to an estimator of \(\varphi _{k+1}\) that is biased for any finite sample size N, because the optimal solution to (12.18) involves a ratio of integrals/expectations. Consequently, common implementations of these algorithms either require hundreds or even thousands of candidate solutions to be generated per iteration [6, 35], or require the use of a sample size N that increases at least polynomially with k in order to reduce the ratio bias effect [13, 18, 21, 22].

This bias issue has been addressed in [23, 24] by replacing the sample average approximation \(\bar{Q}_{k}(\varphi )\) with the stochastic averaging procedure

$$\displaystyle{ \hat{Q}_{k+1}(\varphi ) = (1 -\beta _{k})\hat{Q}_{k}(\varphi ) +\beta _{k} \frac{1} {N}\sum _{i=1}^{N}\frac{S_{k+1}(h(X_{k}^{i}))} {f_{\varphi _{k}}(X_{k}^{i})} \ln f_{\varphi }(X_{k}^{i}), }$$
(12.19)

with \(\hat{Q}_{1}(\varphi ):= \frac{1} {N}\sum _{i=1}^{N}\big(S_{ 1}(h(X_{0}^{i}))/f_{\varphi _{ 0}}(X_{0}^{i})\big)\ln f_{\varphi }(X_{ 0}^{i})\), where β k is a step size constant satisfying \(\beta _{k} \in (0,1]\ \forall \,k\). Note that this procedure incrementally updates the current estimate of the Q-function as new sampling information becomes available at each iteration. In addition, due to the recursive nature of (12.19), all candidate solutions generated in the previous iterations contribute to the estimation of the Q-function \(Q_{k+1}(\varphi )\). Consequently, it is reasonable to expect that the number of samples per iteration N can be significantly reduced or even held at a small constant value.

It is interesting to note that when NEF is used, \(\hat{Q}_{k+1}(\varphi )\) can be expressed as a linear combination of the parameter vector \(\varphi\) and the function \(K(\varphi )\):

$$\displaystyle{\hat{Q}_{k+1}(\varphi ) =\varphi ^{T}\mathcal{S}_{ k+1} - K(\varphi )\mathcal{R}_{k+1},}$$

where the quantities \(\mathcal{S}_{k}\) and \(\mathcal{R}_{k}\) can be computed via the respective recursions

$$\displaystyle\begin{array}{rcl} \mathcal{S}_{k+1}& =& \mathcal{S}_{k} +\beta _{k}\bigg( \frac{1} {N}\sum _{i=1}^{N}\frac{S_{k+1}(h(X_{k}^{i}))} {f_{\varphi _{k}}(X_{k}^{i})} \varGamma (X_{k}^{i}) -\mathcal{S}_{ k}\bigg), {}\\ \mathcal{R}_{k+1}& =& \mathcal{R}_{k} +\beta _{k}\bigg( \frac{1} {N}\sum _{i=1}^{N}\frac{S_{k+1}(h(X_{k}^{i}))} {f_{\varphi _{k}}(X_{k}^{i})} -\mathcal{R}_{k}\bigg), {}\\ \end{array}$$

with \(\mathcal{S}_{1}:= \frac{1} {N}\sum _{i=1}^{N}S_{ 1}(h(X_{0}^{i}))/f_{\varphi _{ 0}}(X_{0}^{i})\varGamma (X_{ 0}^{i})\) and \(\mathcal{R}_{1}:= \frac{1} {N}\sum _{i=1}^{N}S_{ 1}(h(X_{0}^{i}))/f_{\varphi _{ 0}}(X_{0}^{i}).\) Thus, by substituting \(\hat{Q}_{k+1}(\varphi )\) for \(Q_{k+1}(\varphi )\) in (12.18), we have the following optimization problem:

$$\displaystyle\begin{array}{rcl} \varphi _{k+1} =\mathop{ \mathrm{ arg\,max}}_{\varphi \in \Theta }\Big(\varphi ^{T}\mathcal{S}_{ k+1} - K(\varphi )\mathcal{R}_{k+1}\Big),& & {}\\ \end{array}$$

whose unique closed-form solution is given by

$$\displaystyle{ m(\varphi _{k+1}) = \frac{\mathcal{S}_{k+1}} {\mathcal{R}_{k+1}}\ \mbox{ or equivalently }\ \varphi _{k+1} = m^{-1}\Big( \frac{\mathcal{S}_{k+1}} {\mathcal{R}_{k+1}}\Big). }$$

The above stochastic averaging idea has been combined with the MARS algorithm of Sect. 12.4, leading to a two-time-scale stochastic approximation type of algorithm called MARS with stochastic averaging (MARS-SA).

MARS with Stochastic Averaging (MARS-SA)

Step 0.:

Select temperature parameters {T k } and gain sequences {α k } and {β k }.

Step 1.:

Generate N i.i.d candidate solutions \(\varLambda _{k}:=\{ X_{k}^{1},\ldots,X_{k}^{N}\}\) from \(f_{\varphi _{k}}(x)\).

Step 2.:

Update \(\mathcal{S}_{k+1}\) and \(\mathcal{R}_{k+1}\) according to the recursions:

$$\displaystyle\begin{array}{rcl} \mathcal{S}_{k+1}& =& \mathcal{S}_{k} +\beta _{k}\bigg( \frac{1} {N}\sum _{x\in \varLambda _{k}}\frac{e^{h(x)/T_{k+1}}} {f_{\varphi _{k}}(x)} \varGamma (x) -\mathcal{S}_{k}\bigg), {}\\ \mathcal{R}_{k+1}& =& \mathcal{R}_{k} +\beta _{k}\bigg( \frac{1} {N}\sum _{x\in \varLambda _{k}}\frac{e^{h(x)/T_{k+1}}} {f_{\varphi _{k}}(x)} -\mathcal{R}_{k}\bigg). {}\\ \end{array}$$
Step 3.:

Compute a new parameter \(\varphi _{k+1}\) as

$$\displaystyle{m(\varphi _{k+1}) =\alpha _{k} \frac{\mathcal{S}_{k+1}} {\mathcal{R}_{k+1}} + (1 -\alpha _{k})m(\varphi _{k}).}$$

Set k ← k + 1 and reiterate from Step 1.

By exploiting the connection of MARS-SA to stochastic approximation method, it is shown in [23, 24] that the algorithm converges globally even when the per iteration sample size N is held at a small constant value. In addition, preliminary empirical results reported in [24] indicate that the new algorithm can be more efficient (in terms of the number of performance evaluations) than the original MARS algorithm.

12.6 Conclusions and Open Research Questions

In this chapter, we have provided an overview of three model-based optimization methods and discussed their extensions to simulation optimization. In particular, the MRAS method discussed in Sect. 12.3 offers a general framework to design and implement model-based algorithms, whereas the stochastic approximation method presented in Sect. 12.4 provides a systematic approach to analyze the convergence properties of these algorithms. In addition, we have also outlined in Sect. 12.5 a stochastic averaging procedure that addresses the estimator bias issue in model-based algorithms and aims to make these algorithms computationally more efficient. These methods are illustrated through a number of exemplary algorithms that are both provably convergent and exhibit promising empirical performance.

There are many challenging research issues that remain to be addressed. For example, the theoretical convergence and empirical performance of model-based algorithms are greatly influenced by the choices of reference distributions. So a natural research question is how the reference distributions should be chosen for specific problems. Moreover, since existing convergence results are all asymptotic in nature, a theoretical issue is to study whether finite-time performance bounds (e.g., similar to those in stochastic adaptive search [39]) can also be developed for these algorithms.

Model-based algorithms generally do not make use of problem structure, whereas in a continuous-variable simulation optimization setting, there may be additional information available (e.g., stochastic gradient estimates, Lipschitz continuity) from problem knowledge. It is well-known that effective use of structure may dramatically improve the solution efficiency. Therefore, another interesting research direction is to investigate how to incorporate problem structure information into model-based algorithms, as well as to identify classes of problems for which this can be done in a systematic manner.

The extension of model-based algorithms to simulation optimization is carried out in a relatively straightforward manner by introducing an additional simulation allocation sequence {M k } to obtain increasingly precise performance estimates as the search proceeds. This motivates the design of new model-based algorithms that do not rely on expending additional simulation effort on performance estimation (M k  = 1 for all k), e.g., in a way that the simulation noise will act like martingale difference noise in stochastic approximation and automatically average out over a large number of iterations. This is yet another avenue of research that merits further investigation.