Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

Stochastic approximation (SA) is a recursive algorithm that can be viewed as the stochastic counterpart to steepest descent in deterministic optimization. SA was introduced by Robbins and Monro in 1951 [32] to solve noisy root-finding problems and was later applied to the setting of stochastic optimization by solving for the zero of the gradient. The gradient-free setting was addressed by Kiefer and Wolfowitz in 1952 [24]. SA is currently one of the most widely applicable and most useful methods for simulation optimization.

Consider the stochastic optimization problem

$$\displaystyle\begin{array}{rcl} \min _{x\in \varTheta }f(x),& &{}\end{array}$$
(6.1)

where \(f(x) = \mathsf{E}[Y (x,\xi )]\) is a performance measure, Y (x, ξ) is a sample performance, ξ denotes the stochastic effects, and \(\varTheta \subseteq \mathbb{R}^{d}\) is a continuous parameter space. In this case, the objective is to find a sequence {x n } that converges to a unique (local) optimum

$$\displaystyle\begin{array}{rcl} x^{{\ast}} =\arg \min _{ x\in \varTheta }f(x),& &{}\end{array}$$
(6.2)

by using the recursion

$$\displaystyle\begin{array}{rcl} x_{n+1} =\varPi _{\varTheta }\left (x_{n} - a_{n}\hat{\nabla }f(x_{n})\right ),& &{}\end{array}$$
(6.3)

where Π Θ (x) is a projection of x back into the feasible region Θ if xΘ, a n  > 0 is the step size or gain size, \(\hat{\nabla }f(x_{n})\) is an estimate of the gradient ∇f(x n ), and x N is the output, where N is the stopping time, which we denote by \(x_{N}^{{\ast}}\). The projection operator is only required in the constrained setting. Moreover, the minimization problem in (6.1) and (6.2) could easily be changed to maximization by changing the sign of a n in (6.3). The two classical methods, Robbins–Monro (RM) and Kiefer–Wolfowitz (KW), estimate ∇f(x n ) using unbiased direct gradient estimates and finite difference gradient estimates, respectively. Under certain conditions, RM and KW have the respective asymptotic convergence rates O(n −1∕2) and O(n −1∕3).

Advances in SA have included the development of new algorithms, modifications to existing ones, and new asymptotic and finite-time theory. Asymptotic convergence properties of KW, RM and their variations have been a major research focus (cf. [12, 14, 15, 28, 31, 41]). The original RM and KW algorithms apply to one-dimensional problems, but they were later extended to the multidimensional case [2]. Furthermore, the earlier conditions used to prove convergence for RM and KW were relaxed to obtain almost sure convergence [2]. The estimate x n in (6.3) was shown to be asymptotically normal [15] with the optimal convergence rate of O(n −1∕2) [8]. More recently, researchers have placed greater emphasis on finite-time theory as well as error bounds on the difference between objective value at the current estimate and the optimal objective value (i.e., \(\mathsf{E}[f(x_{N}^{{\ast}}) - f(x^{{\ast}})]\)), which the next chapter treats in more detail.

Although recursion (6.3) is quite simple, the choice of step-size sequence {a n }, gradient estimator \(\hat{\nabla }f(x_{n})\), projection operator Π Θ , and output \(x_{N}^{{\ast}}\) each has a significant impact on the performance of the algorithm.

We first discuss the influence of step-size sequence {a n } on the finite-time performance. It is widely known that the practical performance of the classical RM and KW algorithms is highly dependent on the choice of {a n } and often performs poorly without tuning. The algorithm can experience a long oscillatory period if the gain sequence {a n } is “too large,” where the iterates jump back and forth without approaching the optimum x , which can be seen in Fig. 6.1 (left graph), or a degraded convergence rate if {a n } is “too small” relative to the magnitude of the gradient, where the iterates barely move, which can be seen in Fig. 6.1 (right graph) (only x 1 is labeled since the other iterates are in the same position).

Fig. 6.1
figure 1

Sensitivity of SA to step size a n when a n is “too large” relative to the gradient (left graph) and when a n is “too small” relative to the gradient (right graph)

One approach to tackle the sensitivity is to develop robust step-size sequences, i.e., an adaptive step-size rule. The earliest attempt at adaptive step sizes was Kesten’s rule, which can be applied to both RM and KW [23]. This step size only decreases when there is a directional change in the iterates, i.e., \((x_{n+1} - x_{n})(x_{n} - x_{n-1}) < 0\). The idea behind this adaptive step size is that, if the iterates move in the same direction, there is reason to believe they are not in close proximity of the optimum, so the momentum should not be decreased. Later, this idea was extended by increasing the step size, as opposed to keeping it constant when the consecutive errors in the estimates are of the same sign, to increase the speed to convergence [37]. A recent attempt, called scaled-and-shifted Kiefer–Wolfowitz (SSKW) and described in more detail in Sect. 6.4.1, adaptively adjusts the step-size sequence {a n } finitely many times during the course of a modified version of KW [4]. The rationale behind the procedure is to increase the gain size so the iterates are able to move from one endpoint to the other in the one-dimensional case (ideas are analogous in the multidimensional case [3]), which ensures the step sizes are large enough to make noticeable progress towards the optimum in finite-time. If the step sizes {a n } are too large, then they decrease at a faster pace during the recourse stage. Another method used to select an adaptive step size is generating an approximation of the inverse of the Hessian, which is the stochastic analogue to the deterministic Newton–Raphson method [42]. According to Yakowitz et al. [45], “the optimal choice [of step-size sequence] involves the Hessian of the risk [objective] function, which is typically unknown and hard to estimate.” Hessians have been estimated using a set of finite difference gradient approximations [16], heuristics based on quasi-Newton methods [22], and finite difference approximations using gradient estimates [33, 42]. The choice of {a n } has a significant impact on the finite-time performance of the algorithm, which is quite difficult to characterize theoretically.

Another approach to reduce the sensitivity of the estimated optimum to {a n } is to modify the output so that the algorithm puts less emphasis on the last iterate. The underlying idea behind methods of this type is to take longer steps and incorporate a subset of the iterates into the output to decrease the reliance on the last iterate. Iterate averaging, which takes the average of all iterates as the output as opposed to the final iterate, was the earliest proposal [31, 34]. This algorithm can be easily implemented, is “robust,” since it is less sensitive to the initial step size choice, and exhibits O(n −1∕2) asymptotic convergence rate under appropriate conditions. A generalization of iterate averaging, called robust stochastic approximation (RSA) algorithm [30], uses the step-size sequence {a n } for weighting the iterates, which will be described in more detail in Sect. 6.4. Another generalization introduces a proximity function into the objective function, which acts as a regularization term to prevent the next iterate update x n+1 from being too far from x n . An example from this class of algorithms called the accelerated stochastic approximation (AC-SA) algorithm [20] is detailed in Sect. 6.4.

Asymptotic theory for SA initially only considered functions satisfying specific global conditions; however, it is only necessary for the requirements to hold for a compact set as long as it contains the optimum, so the projection operator is particularly important in the constrained optimization setting. The feasible region Θ must be large enough to increase the likelihood that x  ∈ Θ, but enlarging the search space may deteriorate the performance of the algorithm. Adaptively increasing the search space still leads to provable convergence with an appropriate projection operator, such as adaptively projecting the iterates onto an increasing compact set [1].

The gradient estimate is also clearly central to any SA algorithm, and thus the subject of stochastic gradients is treated in depth in Chap. 5. The most common gradient estimate is obtained using finite differences, because it only requires performance measures and no other additional information from the system. For each dimension, the finite difference gradient estimate requires two performance measures, and if the measurements are highly volatile, then the gradient estimates can be noisy. Furthermore, finite difference estimates become computationally expensive in high dimensions, since the cost grows linearly with the parameter dimension [2]. Simultaneous perturbation stochastic approximation (SPSA) [38] only requires two estimates of the objective function to approximate the gradient, and the computational cost is independent of the dimension of the parameter space. Recently, a new SA algorithm called Secant-Tangents AveRaged stochastic approximation (STAR-SA) has been proposed, and it employs a hybrid gradient estimator that combines direct gradient estimates with a symmetric finite difference gradient estimate [5, 6].

The remainder of the chapter is organized as follows. We introduce the classical stochastic approximation methods, Robbins–Monro (RM) and Kiefer–Wolfowitz (KW) in Sect. 6.2. In Sect. 6.3, we present some of the most useful enhancements for simulation optimization: Kesten’s rule, Ruppert–Polyak averaging iterates, varying bounds, and SPSA. We describe four recent developments, scaled-and-shifted Kiefer–Wolfowitz (SSKW), robust stochastic approximation (RSA), accelerated stochastic approximation (AC-SA) for convex and strongly convex functions, and Secant-Tangents AveRaged (STAR) stochastic approximation in Sect. 6.4. In Sect. 6.5, we present numerical experiments comparing KW-type algorithms (original KW, KW with Kesten’s rule, SSKW), RM-type methods (original RM, RM with iterate averaging, RSA, AC-SA), and single versus mixed gradients (RM, KW, STAR-SA). Finally, we close with concluding remarks in Sect. 6.6.

6.2 Classical Methods

The classical RM and KW algorithms address unconstrained stochastic optimizations problems, so we consider the recursive scheme

$$\displaystyle\begin{array}{rcl} x_{n+1} = x_{n} - a_{n}\hat{\nabla }f(x_{n}),& &{}\end{array}$$
(6.4)

which is identical to (6.3) with the exception of the projection operator.

6.2.1 Robbins–Monro (RM) Algorithm

The RM algorithm was introduced to solve the root-finding problem

$$\displaystyle\begin{array}{rcl} M(x) =\alpha & & {}\\ \end{array}$$

for \(x \in \mathbb{R}\), where M(x) is a monotone function and \(\alpha \in \mathbb{R}\). However, it was later applied to a specific case of root-finding in the stochastic optimization setting, where the objective is to optimize a stochastic objective function f(x) by setting M(x) = ∇f(x) and α = 0. RM solves this problem iteratively as in (6.3) by replacing \(\hat{\nabla }f(x_{n})\) with an unbiased estimator, and the output is taken as the last iterate, \(x_{N}^{{\ast}}\), where N is the stopping time. However in RM, the direct gradient measurements are still approximations to the actual gradient because of the presence of noise (\(\hat{\nabla }f(x_{n}) = \nabla f(x_{n}) +\epsilon _{n}\), where ε n is noise with zero mean). Given the appropriate parameters, this algorithm converges asymptotically at a rate of O(n −1∕2) [35].

Theorem 6.1 (Theorem 2 [32]).

Assume ∇f(x) has a unique root x and suppose \(\hat{\nabla }f(x)\) is an unbiased gradient estimator, i.e., \(\nabla f(x) = \mathsf{E}[\hat{\nabla }f(x)]\) . If the sequence {x n } is generated from  (6.4) and the following conditions hold:

  1. 1.

    {a n } is a sequence of positive constants such that \(\sum _{n=1}^{\infty }a_{n} = \infty \) and \(\sum _{n=1}^{\infty }a_{n}^{2} < \infty.\)

  2. 2.

    \(\nabla f(x) \geq 0\ \mathit{for}\ x > x^{{\ast}}\ \mathit{and}\ \nabla f(x) \leq 0\ \mathit{for}\ x < x^{{\ast}}\) .

  3. 3.

    There exists a positive constant C such that \(\mathsf{P}(\vert \hat{\nabla }f(x)\vert \leq C) = 1\ \forall x.\)

Then \(x_{n}\mathop{ \rightarrow }\limits^{ L^{2}}x^{{\ast}}\) as n →∞, where \(\mathop{\rightarrow }\limits^{ L^{2}}\) denotes mean-squared convergence.

The most well-known conditions are restrictions on the gain sequence {a n }. Generally, a n  → 0 but \(\sum _{n=1}^{\infty }a_{n} = \infty \), which prevents the step size from converging to zero too quickly, so the iterates are able to make progress to x and do not get stuck at a poor estimate. The usual form is \(a_{n} = \frac{\theta _{a}} {(n+A)^{\alpha }}\), where θ a  > 0,  A ≥ 0 and \(\frac{1} {2} <\alpha \leq 1\), with A = 0 and α = 1 as a commonly used choice. The objective function f is assumed to have a global minimum with a bounded derivative.

6.2.2 Kiefer–Wolfowitz (KW) Algorithm

The KW stochastic approximation algorithm is referred to as a gradient-free or stochastic zeroth-order method in the following chapter, since it only requires noisy measurements of the function and does not require additional information on the system dynamics or input distributions. The original KW iterative scheme

$$\displaystyle\begin{array}{rcl} x_{n+1} = x_{n} - a_{n}\dfrac{Y (x_{n} + c_{n},\xi _{n}^{+}) - Y (x_{n} - c_{n},\xi _{n}^{-})} {2c_{n}},& &{}\end{array}$$
(6.5)

estimates the gradient using a symmetric finite difference gradient estimate, and under certain conditions, KW can achieve an asymptotic convergence rate of O(n −1∕3). In addition, common random numbers (CRN) can be employed to decrease the variance of estimates, and KW can achieve an asymptotic convergence rate of O(n −1∕2) in certain settings [25].

Theorem 6.2 (Theorem in [24]).

Assume \(f(x) = \mathsf{E}[Y (x,\xi )]\) . If the sequence {x n } is generated from  (6.5) and the following conditions hold:

  1. 1.

    Let {a n } and {c n } be positive tuning sequences satisfying the conditions

    $$\displaystyle{c_{n} \rightarrow 0,\ \sum _{n=1}^{\infty }a_{ n} = \infty,\ \sum _{n=1}^{\infty }a_{ n}c_{n} < \infty,\ \sum _{n=1}^{\infty }a_{ n}^{2}c_{ n}^{-2} < \infty.}$$
  2. 2.

    f(x) is strictly decreasing for x < x , strictly increasing for x > x .

  3. 3.

    \(\mathrm{Var}[Y (x,\xi )] < \infty \) and the following regularity conditions hold:

    1. 1)

      There exist positive constants β and B such that

      $$\displaystyle{\vert x' - x^{{\ast}}\vert + \vert x'' - x^{{\ast}}\vert <\beta \Longrightarrow\vert f(x') - f(x'')\vert < B\vert x' - x''\vert.}$$
    2. 2)

      There exist positive ρ and R such that

      $$\displaystyle{\vert x' - x''\vert <\rho \Longrightarrow\vert f(x') - f(x'')\vert < R.}$$
    3. 3)

      For every δ > 0 there exists a positive π(δ) such that

      $$\displaystyle{\vert x - x^{{\ast}}\vert >\delta \Longrightarrow\inf _{ \frac{\delta } {2} >\epsilon >0}\frac{\vert f(x+\epsilon ) - f(x-\epsilon )\vert } {\epsilon } >\pi (\delta ).}$$

Then \(x_{n}\mathop{ \rightarrow }\limits^{ p}x^{{\ast}}\) as n →∞, where \(\mathop{\rightarrow }\limits^{ p}\) denotes convergence in probability.

Condition 1 assures that the step size a n does not converge to zero too fast, so the iterates do not get stuck at a poor estimate. In addition, the condition restricts the finite difference step size c n from decreasing too quickly, which constrains the noise of the gradients. The second condition insures that there is a global optimum. The first regularity condition requires f(x) to be locally Lipschitz in a neighborhood of x ; the second one prevents f(x) from changing drastically in the feasible region; and the last one prohibits the function from being very flat outside a neighborhood of x so that the iterates approach the optimum. Although the KW algorithm converges asymptotically, its finite-time performance is dependent on the choice of tuning sequences, {a n } and {c n }. If the current x n is in a relatively flat region of the function and the a n is small, then the convergence will be slow. On the other hand, if the x n is located in a very steep region of the function and {a n } is large, then the iterates will experience a long oscillation period. If {c n } is too small, the gradient estimates using finite differences could be extremely noisy.

KW has been extended to higher dimensions, and two common gradients considered are symmetric differences and forward differences whose ith component is given by

$$\displaystyle\begin{array}{rcl} \hat{\nabla }f_{i}(x_{n}) = \left \{\begin{array}{ll} \dfrac{Y (x_{n} + c_{n}e_{i},\xi _{n,i}^{+}) - Y (x_{n} - c_{n}e_{i},\xi _{n,i}^{-})} {2c_{n}} &\mbox{ symmetric difference}, \\ \dfrac{Y (x_{n} + c_{n}e_{i},\xi _{n,i}^{+}) - Y (x_{n},\xi _{n,i})} {c_{n}} &\mbox{ one-sided forward difference}, \end{array} \right.& & {}\\ & & {}\\ \end{array}$$

where e i denotes a d-dimensional ith unit basis vector, \(c_{n} \in \mathbb{R}^{+}\), and Y (x, ξ) is an unbiased estimate of f(x). This method perturbs each component of x n (i.e., x n, i for \(i = 1,\mathop{\ldots },d\)) one at a time while holding all others constant and returns a corresponding function value estimate. For instance, symmetric differences requires the estimate of two function values \(f(x_{n} + c_{n}e_{i})\) and \(f(x_{n} - c_{n}e_{i})\) for \(i = 1,\mathop{\ldots },d\), and forward differences requires f(x n ) and \(f(x_{n} + c_{n}e_{i})\) for \(i = 1,\mathop{\ldots },d\); therefore, using symmetric and one-sided forward difference estimates involves 2d and d + 1 simulation replications, respectively. Although using the symmetric difference scheme is computationally more expensive, it has the potential to reach an asymptotic convergence rate of O(n −1∕3) compared to O(n −1∕4) for forward differences. For d = 1, the computational cost is identical for both the symmetric difference and one-sided forward difference. Compared with the RM algorithm, however, KW convergence rates are typically inferior, although under certain conditions with CRN \(\xi _{n,i}^{+} =\xi _{ n,i}^{-}\), KW algorithms also can achieve the O(n −1∕2) asymptotic convergence rate. For simulation optimization, RM is not always applicable since additional information is needed, which may not be readily available or is difficult to obtain. For KW, there is an additional task of appropriately choosing the difference sequence {c n }. In general, KW is a simple algorithm to implement for simulation optimization applications, albeit costly in high-dimensional settings.

6.3 Well-Known Variants

In this section, we elaborate on Kesten’s rule, iterate averaging, adaptively varying bounds, and SPSA.

6.3.1 Kesten’s Rule

It is well-known that the classical SA algorithms are extremely sensitive to the step-size sequence {a n }. Therefore, it could be advantageous to consider adaptive step sizes that adjust based on the ongoing performance of the algorithm, in hopes of adapting them to the characteristics of the function at the current location of the iterate and proximity of the current iterate to the optimum. Kesten’s rule [23] decreases the step size only when there is a directional change in the iterates. The notion behind this adaptive step size is that, if the iterates continue in the same direction, there is reason to believe they are approaching the optimum and the pace should not be decreased in order to accelerate the convergence. If the errors in the estimate values change signs, it is an indication that either the step size is too large and the iterates are experiencing long oscillation periods or the iterates are in the vicinity of the true optimum; either way, the step size should be reduced to a more appropriate step size or to hone in on x . The following algorithm is for the one-dimensional case d = 1.

SA Algorithm Using Kesten’s Rule

  • Input. Choose \(x_{1} \in \varTheta,\{a_{n}\}\), Π Θ , and stopping time N.

  • Initialize.

    • Let n = 2 and k = 1.

    • Generate an estimate \(\hat{\nabla }f(x_{1})\) of \(\nabla f(x_{1})\).

    • Compute \(x_{2} =\varPi _{\varTheta }(x_{1} - a_{1}\hat{\nabla }f(x_{1}))\).

  • While n < N,

    • Step 1. Generate an estimate \(\hat{\nabla }f(x_{n})\) of ∇f(x n ).

    • Step 2. Compute \(x_{n+1} =\varPi _{\varTheta }(x_{n} - a_{k}\hat{\nabla }f(x_{n}))\). If \((x_{n+1} - x_{n})(x_{n} - x_{n-1}) < 0\), go to Step 3. Otherwise, go to Step 4.

    • Step 3. Let n = n + 1 and k = k + 1. Go to Step 1.

    • Step 4. Let n = n + 1. Go to Step 1.

  • Output. \(x_{N}^{{\ast}} = x_{N}\).

Kesten’s rule can be applied to both RM and KW and still guarantee convergence in probability, as long as {a n } satisfies condition (1) in Theorems 1 and 2 for RM and KW, respectively [23]. An extension of Kesten’s rule to higher dimensions is discussed in [11]. See [18] for an extensive review of both deterministic and stochastic step sizes.

6.3.2 Averaging Iterates

Iterate averaging approaches SA from a different angle. Instead of fine-tuning the step sizes to adapt to the function characteristics, iterate averaging takes bigger steps (i.e., a n larger than O(n −1)) for the estimates to oscillate around the optimum, so the average of the iterates will result in a good approximation to the true optimum. The idea is simple, and yet can be very effective. It is easy to see that for this method to be successful, it is essential for the iterates to surround the optimum in a balanced manner, and the domain for which the iterates oscillate shrinks as n increases. Averaging trajectories reduces the sensitivity to the initial step size choice. The algorithm follows recursion (6.3) for the RM case; however, instead of the taking the last iterate x N as the output, the optimum is estimated by

$$\displaystyle\begin{array}{rcl} x_{N}^{{\ast}} = \frac{1} {N}\sum _{n=1}^{N}x_{ n},& & {}\\ \end{array}$$

which is an average of N iterates, where N is the stopping time. Under “classic” assumptions, iterate averaging achieves the same convergence rate as the RM method. Furthermore, \(\sqrt{n}(x_{ n}^{{\ast}}- x^{{\ast}})\) is asymptotically normal with mean zero and the smallest covariance matrix, which is the inverse of the average Fisher information matrix. (cf. [31]). A constant step size can be applied and yields convergence in distribution [28].

A variation of this method is called the “sliding window” average, which is based on the last m iterates:

$$\displaystyle\begin{array}{rcl} x_{N}^{{\ast}} = \frac{1} {m}\sum _{n=N-m+1}^{N}x_{ n}.& &{}\end{array}$$
(6.6)

An advantage of (6.6) is it ignores the first Nm iterates, which may be poor estimates, since the first iterate is arbitrary, and averages only the last m, which are assumed to be closer to x . Asymptotic normality for a growing window is shown in [26, 28], which also includes constant step sizes. Another modification of the original method incorporates \(x_{N}^{{\ast}}\) with x N in the components being averaged, which is known as the feedback approach [27]. These methods are suited for problems where the iterates hover around the optimum. In an empirical study, iterate averaging was applied to SPSA [29]. The results suggest that if the Hessian of f(x) is large, averaging is considered ideal, since it is associated with a high variability in f(x), which indicates the iterates are moving around the optimum. In general, averaging iterates leads to more robustness with respect to step-size sequence because of the reduced sensitivity, while converging at the same optimal asymptotic rate as RM. Inspired by iterate averaging, weighted averages for KW was presented to achieve the optimal asymptotic convergence rate O(n −1∕2) under certain conditions [13]. Under certain parameter settings, iterate averaging and weighted averaging produce the same estimator.

6.3.3 Varying Bounds

Initially, the asymptotic theory for SA only considered functions satisfying specific global conditions; however, subsequently it was shown the requirements need only hold on a compact set \(\varTheta \in \mathbb{R}^{d}\) containing the optimum. Therefore, the projection operator is particularly important in the constrained optimization setting. Since the optimum is unknown, the compact set should be large enough so that x  ∈ Θ with high probability; however, this may increase the potential of an algorithm to perform poorly due to the size of the parameter search space [1]. For instance, if the compact set is very large, the step size is extremely small, and the current iterate is extremely far from the optimum, then the convergence is likely to be slow; however, if the compact set is small and contains the optimum, then the iterates will never be too far from the optimum. Even if the step sizes are small, the convergence will be much faster in comparison to the algorithm restricted to a much larger set.

One of the first ideas was to project the iterates onto a predetermined fixed point once the magnitude of the iterate surpassed an arbitrarily specified threshold, with the threshold increasing after it is exceeded [10]. This method converges asymptotically, but in practice, it has its pitfalls. When an iterate is projected onto an arbitrary fixed point, in a sense, the algorithm restarts from this “initial” value with a smaller sequence of step sizes. Not only does it lose all of the progress gained from the iterations prior to the projection, but the reduction in step size could hinder the convergence by moving even slower towards the optimum. To circumvent this issue, it was shown that it suffices to project the iterates onto a predetermined bounded set [46]. This is a slight improvement, since the iterates do not start from the same position with an even smaller step size. However, it still has its limitations, since the initial start values are restricted to the predetermined compact set. Later, an algorithm defined over a growing feasible region by writing Θ as an increasing sequence of compact sets (i.e., \(\varTheta _{m} \subseteq \varTheta _{m+1}\), where \(\varTheta = \cup _{m}\varTheta _{m}\)) was introduced [1]. The orthogonal projection operator changes from \(\varPi _{\varTheta _{ m}}\) to \(\varPi _{\varTheta _{ m+1}}\) if \(x_{n}\notin \varTheta _{m}\). The idea is to start with a smaller feasible region Θ 1 and only increase when there is reason to believe the optimum \(x^{{\ast}}\notin \varTheta _{1}\) (i.e., when the \(x_{n}\notin \varTheta _{1}\)). Since the projection is made onto the current compact set Θ m , the progress gained up to that point is not lost. The feasible region Θ is written as \(\cup _{m=1}^{\infty }\varTheta _{m}\), so it is impossible for \(x^{{\ast}}\notin \varTheta _{m}\) for some m. If x is contained in one of the earlier compact sets and if they grow slowly, the empirical results could improve significantly. The key in the performance is to choose the sequence {Θ m } appropriately. If it grows too quickly, the results might be very similar to that of the original SA algorithm. The following algorithm and convergence result are for the RM multidimensional case d ≥ 1, where | | ⋅ | | denotes the Euclidean norm.

SA with Varying Bounds

  • Input. Choose \(x_{1} \in \varTheta _{1}\), {a n } and {Θ m }.

  • Initialize. Let n = 1 and m = 1.

  • While n < N,

    • Step 1. Generate an estimate \(\hat{\nabla }f(x_{n})\) of ∇f(x n ).

    • Step 2. Compute \(x_{n+1}' = x_{n} - a_{n}\hat{\nabla }f(x_{n})\). If \(x_{n+1}' \in \varTheta _{m},\) go to Step 3. Otherwise, go to Step 4.

    • Step 3. Let \(x_{n+1} = x_{n+1}'\), n = n + 1 and go to Step 1.

    • Step 4. Let \(x_{n+1} =\varPi _{\varTheta _{m}}(x_{n+1}')\), n = n + 1,  m = m + 1 and go to Step 1.

  • Output. \(x_{N}^{{\ast}} = x_{N}.\)

Theorem 6.3 (Theorem 2 [1]).

Let the sequence {x n } be generated using the above algorithm, \(\epsilon _{n} =\hat{ \nabla }f(x) -\mathsf{E}[\hat{\nabla }f(x)\vert \mathcal{F}_{n}]\) , and \(\beta _{n} = \mathsf{E}[\hat{\nabla }f(x)\vert \mathcal{F}_{n}] -\nabla f(x)\) , where \(\mathcal{F}_{n}\) is the smallest σ-algebra used to generate x n+1 . If the following conditions hold:

  1. 1.

    The sequence {Θ m } is a set of compact convex sets such that \(\varTheta _{m} \subseteq \varTheta _{m+1}\) for all m and \(\cup _{m=1}^{\infty }\varTheta _{m} =\varTheta\) .

  2. 2.

    The positive sequences of real numbers {a n } and {c n } converge to zero such that \(\sum _{n=1}^{\infty }a_{n} = \infty,\ \sum _{n=1}^{\infty }a_{n}c_{n} < \infty,\) and \(\sum _{n=1}^{\infty }a_{n}^{2}c_{n}^{-2} < \infty \) .

  3. 3.

    There exists κ ≥ 0 such that \(\mathsf{E}[\vert \vert \epsilon _{n}\vert \vert ^{2}\vert \mathcal{F}_{n}] \leq \frac{\kappa } {c_{ n}^{2}} (1 + \vert \vert x_{n} - x^{{\ast}}\vert \vert ^{2})\) a.s. for all n.

  4. 4.

    ||β n || is bounded a.s. for all n, and \(\sum _{n=1}^{\infty }a_{n}\vert \vert \beta _{n}\vert \vert < \infty \)  a.s.

  5. 5.

    There exist a positive sequence of real numbers {M n } and integer N ≥ 1 such that \(\sum _{n=1}^{\infty }a_{n}^{2}M_{n}^{2} < \infty \) and for all n ≥ N, \(\sup _{x\in \varTheta _{n-1}}\vert \vert f(x)\vert \vert \leq M_{n}\) .

  6. 6.

    There exists a unique x ∈Θ such that ∇f(x ) = 0, and for all 0 < δ ≤ 1, \(\inf _{x\in \varTheta:\delta \leq \vert \vert x-x^{{\ast}}\vert \vert \leq \delta ^{-1}}f(x)^{\mathsf{T}}(x - x^{{\ast}}) > 0\) .

Then \(x_{n} \rightarrow x^{{\ast}}\) a.s. as n →∞.

If an appropriate increasing sequence of compact sets is chosen, the finite-time performance can improve significantly, but this optimal choice is still an open problem.

6.3.4 Simultaneous Perturbation Stochastic Approximation (SPSA)

Simultaneous perturbation stochastic approximation (SPSA) specifically addresses multivariate optimization problems [38]. Similar to KW-type algorithms, SPSA only requires the objective function values to approximate the underlying gradient and is therefore easy to implement. However, SPSA only requires two functional evaluations at each iteration regardless of the dimension of the parameter space Θ, which could potentially reduce the computational cost significantly in high-dimensional problems. SPSA perturbs the vector x randomly in all directions simultaneously (hence, the name of the method) and the ith component of the gradient estimate has the form

$$\displaystyle\begin{array}{rcl} \hat{\nabla }f_{i}(x_{n}) = \frac{Y (x_{n} + c_{n}\varDelta _{n},\xi _{n}^{+}) - Y (x_{n} - c_{n}\varDelta _{n},\xi _{n}^{-})} {2c_{n}\varDelta _{n,i}},& &{}\end{array}$$
(6.7)

where \(\varDelta _{n} = (\varDelta _{n,1},\mathop{\ldots },\varDelta _{n,d}) \in \mathbb{R}^{d}\) and generally assumed to be i.i.d. and independent across components, \(c_{n} \in \mathbb{R}^{+}\) is the finite difference step size, and \(\xi _{n}^{\pm }\) denotes the randomness. Observe that the numerator in (6.7) involves two function estimates and is identical for all i; therefore, the cost of the full gradient (aside from generating Δ n ) is independent of dimension.

SPSA Algorithm

  • Input. Choose x 1 ∈ Θ, {a n }, {c n }, and stopping time N.

  • Initialize. Let n = 1.

  • While n < N, 

    • Step 1. Generate a d-dimensional random perturbation vector Δ n .

    • Step 2. Generate an estimate of ∇f(x n ):

      $$\displaystyle\begin{array}{rcl} \hat{\nabla }f(x_{n}) = \dfrac{Y (x_{n} + c_{n}\varDelta _{n},\xi _{n}^{+}) - Y (x_{n} - c_{n}\varDelta _{n},\xi _{n}^{-})} {2c_{n}} \left [\begin{array}{*{10}c} \varDelta _{n,1}^{-1}\\ \vdots \\ \varDelta _{n,d}^{-1}\end{array} \right ]& & {}\\ \end{array}$$
    • Step 3. Compute \(x_{n+1} = x_{n} - a_{n}\hat{\nabla }f(x_{n})\).

    • Step 4. Let n = n + 1. Go to Step 1.

  • Output. \(x_{N}^{{\ast}} = x_{N}\).

Theorem 6.4 (Theorem 7.1 [40]).

Suppose f has a unique minimum x ∈Θ and {x n } is generated using SPSA. If the following conditions hold:

  1. 1.

    The positive sequences of real numbers {a n } and {c n } converge to zero such that \(\sum _{n=1}^{\infty }a_{n} = \infty \) and \(\sum _{n=1}^{\infty }a_{n}^{2}c_{n}^{-2} < \infty \) .

  2. 2.

    The function f(x) ∈ C 3 and bounded on \(\mathbb{R}^{d}\) .

  3. 3.

    ||x n || < ∞ for all n.

  4. 4.

    \(\mathsf{E}[\epsilon _{n}^{+} -\epsilon _{n}^{-}\vert \varDelta _{n},\mathcal{F}_{n}] = 0\) and \(\mathsf{E}[(Y (x_{n} \pm c_{n}\varDelta _{n},\xi _{n}^{\pm })/\varDelta _{n,i})^{2}]\) is uniformly bounded for all n,i.

  5. 5.

    x is an asymptotically stable solution of the differential equation ∂x(t)∕∂t = −∇f(x(t)).

  6. 6.

    For each n, \(\{\varDelta _{n,i}\}_{i=1}^{d}\) are identically distributed, {Δ n,i } are independent and symmetrically distributed with zero mean and uniformly bounded in magnitude for all n,i.

Then \(x_{n} \rightarrow x^{{\ast}}\) a.s. as n →∞.

The optimal convergence rate for SPSA is O(n −1∕3) [38]. Various convergence proofs have been presented with slight modifications to the conditions (cf. [9, 13, 19, 38, 43]). The perturbation sequence {Δ n }, where \(\varDelta _{n} = (\varDelta _{n,1},\mathop{\ldots },\varDelta _{n,d})\) with {Δ n, i } independent, must have mean zero (i.e., E[Δ n ] = 0), and finite inverse moments (i.e., \(\mathsf{E}[\vert \varDelta _{n,i}\vert ^{-1}] < \infty \) for \(i = 1,\mathop{\ldots },d\)). As a result, the Gaussian distribution is not applicable. Instead, the most common distribution used is the symmetric Bernoulli taking a positive and negative value (i.e., ± 1) with probability 0.5. In addition, an appropriately scaled x n is approximately normal for large n, and the relative efficiency of SPSA depends on the geometric shape of f(x), choice of {a n } and {c n }, distribution of {Δ n, i }, and noise level.

Many extensions to the original SPSA algorithm have been developed, e.g., the constrained setting using projection operators [17, 36]. A slight modification is the averaging of the SPSA gradient estimators. Instead of generating one gradient estimate at each iteration, multiple gradient estimates can be generated at additional computational cost and averaged to reduce the noise. An accelerated form of SPSA approximates the second-order Hessian ∇2 f(x) to accelerate the convergence [40], analogous to the Newton–Raphson method. Iterate averaging in the SPSA setting has also been explored, but performs relatively poor in finite-time [13, 39]. All in all, SPSA has been shown to be an effective SA method for tackling high-dimensional problems, with ease of implementation and the asymptotic theory to support it.

6.4 Recent Modifications

This section presents several recently proposed modifications that focus on improving the finite-time performance of SA: the scaled-and-shifted Kiefer–Wolfowitz (SSKW) algorithm, the robust SA (RSA) algorithm, the accelerated SA (AC-SA) algorithm, and the Secant-Tangents AveRaged stochastic approximation (STAR-SA) algorithm. The theoretical results for RSA and AC-SA focus on an alternative way to analyze the performance of the estimates through \(f(x_{N}^{{\ast}}) - f(x^{{\ast}})\). The inequality in (6.15) is an alternative way to view the performance of SA, which focuses on the distance between the function evaluated at the estimate and the optimal function value (i.e., \(\mathsf{E}[f(x_{N}^{{\ast}}) - f(x^{{\ast}})]\)) as opposed to a distance between the estimate and optimum (e.g., \(\mathsf{E}(x_{N}^{{\ast}}- x^{{\ast}})^{2}\)). To illustrate the difference, consider an extremely flat function on the entire feasible region. The alternative performance measure will indicate that almost any iterate in the feasible region will be a good estimate, whereas performance based on the mean-squared error (MSE) of the estimate and optimum will be more sensitive to the estimate \(x_{N}^{{\ast}}\). Further details on these two algorithms are provided in the next chapter.

6.4.1 Scaled-and-Shifted Kiefer–Wolfowitz (SSKW)

The scaled-and-shifted Kiefer–Wolfowitz (SSKW) algorithm [4] adaptively adjusts {a n } and {c n } finitely many times during the course of the algorithm to adapt to the characteristics of the function and noise level in hopes of preventing slow convergence in finite-time. The idea is to increase {a n } so the iterates are able to make noticeable progress towards the optimum with the option of decreasing {a n } later if it is too large. Furthermore, if the direction of the gradient is classified as incorrect, then {c n } is increased to reduce the noise. Note that KW only requires two parameter choices {a n } and {c n }, whereas SSKW requires eleven, as seen in the algorithm below.

SSKW Algorithm

Scaling Phase

  • Input. {a n }, {c n }, [l, u], Π Θ , stopping time N, and

    • h 0 = number of forced boundary hits,

    • γ 0 = scale up factor for {c n },

    • k a = maximum number of shifts of {a n },

    • v a = initial upper bound of shift,

    • ϕ a = maximum scale up factor for {a n },

    • k c = maximum number of scale ups for {c n },

    • c 0 = maximum value of {c n } after scale ups (i.e., \(c_{n} \leq c^{max} = c_{0}(u - l)\)),

    • g 0 = maximum number of gradient estimates in scaling phase,

    • m max = maximum number of adaptive iterations (m max  ≤ N).

  • Initialize.

    • Choose \(x_{1} \in [l + c_{1},u - c_{1}]\).

    • Let n = 1, m = 1, g = 1, sh = 0, and sc = 0.

  • Do while m ≤ h 0 and g ≤ g 0.

    • Step 1.

      • Generate an estimate \(\hat{\nabla }f(x_{n})\) using symmetric differences.

      • Compute x n+1 using recursion (6.3).

        • If \(x_{n+1} \in (l + c_{n},x_{n})\), go to Step 2.

        • If \(x_{n+1} \in (x_{n},u - c_{n},)\), go to Step 3.

        • If \(x_{n+1} > u - c_{n+1}\) and \(x_{n } = u - c_{n}\) or if \(x_{n+1} < l + c_{n+1}\) and \(x_{n } = l + c_{n}\), go to Step 4, if \(sc \leq k_{c}\).

        • If \(x_{n+1} > u - c_{n+1}\) and \(x_{n } = l + c_{n}\) or if \(x_{n+1} < l + c_{n+1}\) and \(x_{n } = u - c_{n}\), go to Step 5.

    • Step 2.

      • Scale {a n } up by \(\alpha =\min \{ \phi _{a},(u - c_{n+1} - x_{n})/(x_{n+1} - x_{n})\}\) and use {α a n } for the remaining iterations.

      • Set \(x_{n+1} = l + c_{n+1}\). Let n = n + 1, m = m + 1, g = g + 1 and go to Step 1.

    • Step 3.

      • Scale {a n } up by \(\alpha =\min \{ \phi _{a},(l + c_{n+1} - x_{n})/(x_{n+1} - x_{n})\}\) and use {α a n } for the remaining iterations.

      • Set \(x_{n+1} = u - c_{n+1}\). Let n = n + 1, m = m + 1, g = g + 1 and go to Step 1.

    • Step 4.

      • Scale {c n } up by \(\gamma =\min \{\gamma _{0},c^{max}/c_{n}\}\) and use {γ c n } for the remaining iterations.

      • Let sc = sc + 1 and go to Step 5.

    • Step 5.

      • Set \(x_{n+1} =\min \{ u - c_{n+1},\max \{x_{n+1},l + c_{n+1}\}\}\).

      • Let n = n + 1, g = g + 1 and go to Step 1.

Shifting Phase

  • While n ≤ m max and n ≤ N,

    • Step 1.

      • Generate an estimate \(\hat{\nabla }f(x_{n})\) using symmetric differences.

      • Compute x n+1 using (6.3).

        • If \(x_{n+1} > u - c_{n+1}\) and \(x_{n } = l + c_{n}\) or if \(x_{n+1} < l + c_{n+1}\) and \(x_{n } = u - c_{n}\), go to Step 2, if sh < k a .

        • If \(x_{n+1} > u - c_{n+1}\) and \(x_{n } = u - c_{n}\) or if \(x_{n+1} < l + c_{n+1}\) and \(x_{n } = l + c_{n}\), go to Step 3, if sc < k c .

        • Otherwise, go to Step 4.

    • Step 2.

      • Find smallest integer β′ such that \(x_{n+1} \in (l + c_{n},u - c_{n})\) with a n+β.

      • Set \(\beta =\min (v_{a},\beta ')\) and shift {a n } to \(\{a_{n+\beta }\}\). If \(\beta = v_{a}\), set \(v_{a} = 2v_{a}\).

      • Let sh = sh + 1 and go to Step 4.

    • Step 3.

      • Scale {c n } up by \(\gamma =\min \{\gamma _{0},c^{max}/c_{n}\}\) and use {γ c n } for the remaining iterations.

      • Let sc = sc + 1 and go to Step 4.

    • Step 4.

      • Set \(x_{n+1} =\min \{ u - c_{n+1},\max \{x_{n+1},l + c_{n+1}\}\}\).

      • Let n = n + 1 and go to Step 1.

KW Algorithm

  • If n > m max and n < N, then SSKW reverts back to KW and stop when n = N.

  • Output. \(x_{N}^{{\ast}} = x_{N}\).

The SSKW algorithm has two pre-processing phases, scaling and shifting, which adjust the tuning sequences in order to improve the finite-time performance, before reverting back to the original KW algorithm. In the scaling phase, the {a n } is scaled up by a factor α, i.e., {a n } to {α a n }, so the iterates can move from one boundary to the other to ensure the step sizes are not too small relative to the gradient. In the shifting phase, the sequence {a n } is decreased by shifting or “skipping” a finite number (β) of terms from {a n } to {a n+β }, when the iterates fall outside of the feasible region when the sign of the gradient is correct. This acts as a recourse stage and reduces the step size faster in case the step-size sequence {a n } is too large. During both phases, {c n } is scaled up by γ, i.e., {c n } to {γ c n }, if the previous iterate is at the boundary and the update falls outside the feasible region but is moving in the wrong direction. This increase is an attempt to reduce the noise of the gradient estimate. These adjustments do not affect the asymptotic convergence, since the scaling phase only scales {a n } up by a constant, the shifting phase only skips a finite number of terms in {a n }, and the perturbation sequence {c n } is only scaled up by a constant, all of which occur finitely many times.

6.4.2 Robust Stochastic Approximation (RSA)

The robust SA (RSA) method is intended to be relatively insensitive to the choice of the step-size sequence, similar to Polyak–Ruppert iterate averaging. The form of RSA is identical to (6.3) with the exception of the output. Instead of \(x_{N}^{{\ast}} = x_{N}\), where x N is the last iterate, x N is calculated as

$$\displaystyle\begin{array}{rcl} x_{N}^{{\ast}} = \frac{\sum _{n=1}^{N}a_{ n}x_{n}} {\sum _{n=1}^{N}a_{n}},& & {}\\ \end{array}$$

where a n  > 0 for all n. It is clear that if a n  = a, where \(a \in \mathbb{R}^{+}\) for all n, then \(x_{N}^{{\ast}} = \frac{1} {N}\sum _{n=1}^{N}x_{ n}\), giving the uniformly weighted average of Polyak–Ruppert. As mentioned earlier, iterate averaging under a constant step size for a moving window is asymptotically normal [28]. A finite-time bound was derived for \(\mathsf{E}[f(x_{n}^{{\ast}}) - f(x^{{\ast}})]\) under RSA when f is assumed convex [30]. Assume there exists C > 0 such that \(\mathsf{E}[\vert \vert \nabla f(x)\vert \vert ^{2}] \leq C^{2}\) for all x ∈ Θ. Then for an N-step iteration policy,

$$\displaystyle\begin{array}{rcl} \mathsf{E}[f(x_{N}^{{\ast}}) - f(x^{{\ast}})] \leq \frac{\vert \vert x_{0} - x^{{\ast}}\vert \vert ^{2} + C^{2}\sum _{ n=1}^{N}a_{ n}^{2}} {2\sum _{n=1}^{N}a_{n}}.& &{}\end{array}$$
(6.8)

For equal weights or iterate averaging, the bound on the right hand side of (6.8) can be minimized if

$$\displaystyle\begin{array}{rcl} a_{n} = a:= \frac{D_{\varTheta }} {C\sqrt{N}},& & {}\\ \end{array}$$

where \(D_{\varTheta } =\max _{x,y\in \varTheta }\vert \vert x - y\vert \vert \). The distance \(\vert \vert x_{0} - x^{{\ast}}\vert \vert \) in the place of D Θ tightens the bound in (6.8), but x is unknown so the improvement may not be practically meaningful. This step size requires the number of iterations N to be fixed. Similar to iterate averaging, a sliding window average can also be employed in RSA. The estimate consists of the last NK + 1 estimates and has the form

$$\displaystyle\begin{array}{rcl} x_{N,K}^{{\ast}} = \frac{\sum _{n=K}^{N}a_{ n}x_{n}} {\sum _{n=K}^{N}a_{n}}.& &{}\end{array}$$
(6.9)

If we consider the varying step size

$$\displaystyle\begin{array}{rcl} a_{n} = \dfrac{\theta D_{\varTheta }} {C\sqrt{n}},& &{}\end{array}$$
(6.10)

for θ > 0, then we have the bound

$$\displaystyle\begin{array}{rcl} \mathsf{E}[f(x_{N,K}^{{\ast}}) - f(x^{{\ast}})] \leq \dfrac{D_{\varTheta }C} {\sqrt{N}}\left [\frac{2} {\theta } \left ( \dfrac{N} {N - K + 1}\right ) + \frac{\theta } {2}\sqrt{\frac{N} {K}}\right ],& &{}\end{array}$$
(6.11)

for 1 ≤ K ≤ N.

6.4.3 Accelerated Stochastic Approximation (AC-SA) for Strongly Convex Functions

The accelerated SA (AC-SA) algorithm [21] takes a similar approach to iterate averaging and RSA by taking long strides and incorporating each of the iterates into the output. The next two algorithms, accelerated SA for strongly convex and convex functions, take advantage of the smoothness factor of the function if it exists. AC-SA for convex functions is a special case of AC-SA for strongly convex functions, so we first introduce AC-SA for strongly convex functions and then restrict the strong convexity parameter for the convex case.

AC-SA is an example of a proximal method, which introduce a proximity function into the objective function. The prox-function acts as a regularization term to prevent the next iterate update x n+1 from being too far from x n and is comprised of a distance generating function or Bregman function \(\omega:\varTheta \rightarrow \mathbb{R}\), which is continuously differentiable and strongly convex with modulus ν > 0 satisfying

$$\displaystyle{\langle x - y,\nabla \omega (x) -\nabla \omega (y)\rangle \geq \nu \vert \vert x - y\vert \vert ^{2}\ \forall x,y \in \varTheta,}$$

where \(\langle \cdot,\cdot \rangle\) denotes the inner product. A prox-function with the given distance generating function is

$$\displaystyle{V (x,y) = V _{\omega }(x,y) =\omega (y) - [\omega (x) +\langle \nabla \omega (x),y - x\rangle ].}$$

As \(x_{n} \rightarrow x^{{\ast}}\), the regularization term disappears, so minimizing f(x) plus a regularizer is equivalent to minimizing the function f(x).

Consider a strongly convex function f(⋅ ) satisfying

$$\displaystyle\begin{array}{rcl} \frac{\mu } {2}\vert \vert y - x\vert \vert ^{2} \leq f(y) - f(x) -\langle \nabla f(x),y - x\rangle \leq \frac{L} {2} \vert \vert y - x\vert \vert ^{2} + M\vert \vert y - x\vert \vert,& &{}\end{array}$$
(6.12)

for all x, y ∈ Θ where μ > 0 is the strong convexity parameter. Notice that if f is Lipschitz continuous with Lipschitz constant M∕2, then (6.12) holds with M > 0, L = 0, and μ = 0, and if f has Lipschitz continuous gradients with Lipschitz constant L, then (6.12) holds with M = 0, L > 0, and μ = 0.

The AC-SA algorithm updates three sequences, \(\{x_{n}^{md}\},\{x_{n}^{ag}\}\), and {x n }. Here, “md” and “ag” are abbreviations for median and aggregate, respectively, and median is used in a loose sense.

Accelerated SA Method for Strongly Convex Functions

  • Input.

    • Specify V (x, y), {α n } and {γ n } be given such that α 1 = 1, α n  ∈ (0, 1) for n ≥ 2, and γ n  > 0 for n ≥ 1 and a stopping time N.

  • Initialize. Choose \(x_{0}^{ag} = x_{0} \in \varTheta\) and let n = 1.

  • While n < N,

    • Step 1. Generate an estimate \(\hat{\nabla }f(x_{n})\) of ∇f(x n ).

    • Step 2. Compute

      $$\displaystyle\begin{array}{rcl} x_{n}^{md}& =& \frac{\alpha _{n}[(1 -\alpha _{n})\mu +\gamma _{n}]} {\gamma _{n} + (1 -\alpha _{n}^{2})\mu } x_{n-1} + (1 -\alpha _{n}) \frac{(1 -\alpha _{n})(\mu +\gamma _{n})} {\gamma _{n} + (1 -\alpha _{n}^{2})\mu }x_{n-1}^{ag} {}\\ x_{n}& =& \arg \min _{x\in \varTheta }\{\alpha _{n}[\langle \nabla f(x_{n}^{md}),x\rangle +\mu V (x_{ n}^{md},x)] + [(1 -\alpha _{ n})\mu +\gamma _{n}]V (x_{n-1},x)\} {}\\ x_{n}^{ag}& =& \alpha _{ n}x_{n} + (1 -\alpha _{n})x_{n-1}^{ag} {}\\ \end{array}$$
    • Step 3. Let n = n + 1 and go to Step 1.

  • Output. \(x_{N}^{{\ast}} = x_{N}^{ag}\).

Note: \(V (x,y) = \frac{1} {2}\vert \vert x - y\vert \vert ^{2}\) using the Euclidean norm with ν = 1 is a common prox-function. Refer to [20] for details.

Theorem 6.5 (Theorem 1 [20]).

Assume \(V (x,y) \leq \frac{1} {2}\vert \vert x - y\vert \vert ^{2}\ \text{for all }x,y \in \varTheta\) when μ < 0 and \(\mathsf{E}[(\hat{\nabla }f(x) -\nabla f(x))^{2}] \leq \sigma ^{2}\ \forall x \in \varTheta\) . Choose {α n } and {γ n } such that

$$\displaystyle\begin{array}{rcl} \nu (\mu +\gamma _{n}) > L\alpha _{n}^{2},& &{}\end{array}$$
(6.13)
$$\displaystyle\begin{array}{rcl} \gamma _{n}/\varGamma _{n} =\gamma _{n+1}/\varGamma _{n+1}\ \ \text{for }\ n \geq 1,& &{}\end{array}$$
(6.14)

where

$$\displaystyle{\varGamma _{n} = \left \{\begin{array}{ll} 1 &\mbox{ if $n = 1$};\\ (1 -\alpha _{ n})\varGamma _{n-1} & \mbox{ if $n \geq 2$}. \end{array} \right.}$$

Then,

$$\displaystyle\begin{array}{rcl} \mathsf{E}[f(x_{N}^{ag}) - f(x^{{\ast}})]& \leq & \varGamma _{ N}\left (\gamma _{1}V (x_{0},x^{{\ast}}) +\sum _{ n=1}^{N} \frac{2(M^{2} +\sigma ^{2})\alpha _{ n}^{2}} {\varGamma _{n}[\nu (\mu +\gamma _{n}) - L\alpha _{n}^{2}]}\right ).{}\end{array}$$
(6.15)

Consider α n  = 2∕(n + 1), γ n  = 4L∕[ν n(n + 1)], and Γ n  = 2∕[n(n + 1)]. It can be easily checked that these choices satisfy conditions (6.13) and (6.14). Under these conditions, the right hand side of (6.15) can be bounded by

$$\displaystyle\begin{array}{rcl} \dfrac{4LV (x_{0},x^{{\ast}})} {\nu N(N + 1)} + \dfrac{8(M^{2} +\sigma ^{2})} {\nu \mu (N + 1)},& &{}\end{array}$$
(6.16)

for μ > 0. The bounds in (6.15) and (6.16) rely on additional information of the function and gradient, which are unknown, so they must be approximated.

6.4.4 Accelerated Stochastic Approximation (AC-SA) for Convex Functions

AC-SA for convex functions is a special case of AC-SA for strongly convex functions with μ = 0. The algorithm is identical to AC-SA for strongly convex function with the exception of the \(x_{n}^{md}\) and x n update since μ = 0. The resulting updates are

$$\displaystyle\begin{array}{rcl} x_{n}^{md}& =& \alpha _{ n}x_{n-1} + (1 -\alpha _{n})x_{n-1}^{ag}, {}\\ x_{n}& =& \arg \min _{x\in \varTheta }\{\alpha _{n}\langle \nabla f(x_{n}^{md}),x\rangle +\gamma _{ n}V (x_{n-1},x)\}. {}\\ \end{array}$$

Interestingly, if \(V (x,y) = \frac{1} {2}\vert \vert x - y\vert \vert ^{2}\), then the update for x n simplifies to

$$\displaystyle\begin{array}{rcl} x_{n}& =& \varPi _{\varTheta }\left (x_{n-1} -\frac{\alpha _{n}} {\gamma _{n}}\hat{\nabla }f(x_{n}^{md})\right ),{}\end{array}$$
(6.17)

which has a similar form to the standard SA algorithm. Notice in the update for x n in (6.17), α n γ n takes the place of the step size a n in (6.3) and the gradient estimate \(\hat{\nabla }f\) is evaluated at \(x_{n}^{md}\) as opposed to x n−1. If we consider the same parameter setting as in the strongly convex case, the “step size” \(\alpha _{n}/\gamma _{n}\) increases with n. Furthermore, the lower and upper bounds for the optimal objective function can be computed online and the difference converges to 0 as the number of iterations goes to infinity [20].

Theorem 6.6 (Proposition 7 [20]).

Assume that the assumptions in Theorem  6.5 hold for μ = 0 and the sequences α n = 2∕(n + 1) and γ n = 4γ∕[νn(n + 1)] for γ ≥ 2L. Then

$$\displaystyle\begin{array}{rcl} \mathsf{E}[f(x_{N}^{ag}) - f(x^{{\ast}})]& \leq & \dfrac{4\gamma V (x_{0},x^{{\ast}})} {\nu N(N + 1)} + \dfrac{4(M^{2} +\sigma ^{2})(N + 2)} {3\gamma },{}\end{array}$$
(6.18)

where

$$\displaystyle\begin{array}{rcl} \gamma =\max \left \{2L,\left [\frac{\nu (M^{2} +\sigma ^{2})N(N + 1)(N + 2)} {3V (x_{0},x^{{\ast}})} \right ]^{1/2}\right \}& & {}\\ \end{array}$$

minimizes the bound in  (6.18) .

6.4.5 Secant-Tangents AveRaged Stochastic Approximation (STAR-SA)

The Secant-Tangents AveRaged (STAR) stochastic approximation algorithm estimates the gradient using a hybrid estimator, which is a convex combination of a symmetric finite difference and an average of two direct gradient estimators:

$$\displaystyle\begin{array}{rcl} \hat{\nabla }f(x_{n})& =& \alpha _{n}\frac{Y (x_{n} + c_{n},\epsilon _{n}^{+}) - Y (x_{n} - c_{n},\epsilon _{n}^{-})} {2c_{n}} \\ & & +(1 -\alpha _{n})\left (\frac{Y '(x_{n} + c_{n},\delta _{n}^{+}) + Y '(x_{n} - c_{n},\delta _{n}^{-})} {2} \right ),{}\end{array}$$
(6.19)

where \(\epsilon _{n}^{\pm }\) and \(\delta _{n}^{\pm }\) denote the randomness (i.e., \(f(x_{n} \pm c_{n}) = \mathsf{E}[Y (x_{n} \pm c_{n},\epsilon _{n}^{\pm })]\) and \(f'(x_{n} \pm c_{n}) = \mathsf{E}[Y '(x_{n} \pm c_{n},\xi _{n}^{\pm })]\)), α n  ∈ [0, 1] for all n, c n  → 0 and α n  → 0 as n → . The STAR gradient estimate requires function and gradient estimates on two points, \(x_{n} \pm c_{n}\) for each \(\hat{\nabla }f(x_{n})\). In a setting where direct gradients are available, if the direct gradient is very noisy relative to the function estimates, it is difficult to decide between implementing RM or KW, even though RM converges faster asymptotically. Since the performance of neither algorithm is always superior to the other, the STAR gradient incorporates both. The weights of the convex combination play a critical role in the performance of STAR-SA and can be chosen to minimize the variance of the gradient estimate such that it is less than the variance of both the symmetric finite difference gradient estimate and direct gradient estimate. If

$$\displaystyle\begin{array}{rcl} \alpha _{n}^{{\ast}} = \frac{\sigma _{g}^{2}c_{ n}^{2} +\rho \sigma _{ f}\sigma _{g}c_{n}^{2}} {\sigma _{f}^{2} +\sigma _{ g}^{2}c_{n}^{2} + 2\rho \sigma _{f}\sigma _{g}c_{n}},& & {}\\ \end{array}$$

where \(\mathrm{Var}[Y (x,\epsilon )] =\sigma _{ f}^{2}\), \(\mathrm{Var}[Y '(x,\xi )] =\sigma _{ g}^{2}\), and \(\mathrm{Corr}(Y (x,\epsilon ),Y '(x,\xi )) =\rho\), then STAR-SA is theoretically optimal in terms of MSE compared to RM and KW for simple quadratic functions, and the variance of the STAR gradient is less than that of RM and KW under certain conditions.

Theorem 6.7 (Theorem 3 [6]).

Let {x n } be a sequence generated using recursion  (6.3) and gradient estimate  (6.19) . Assume

  1. 1.

    There exist positive sequences {a n }, {c n }, and {α n } such that α n ∈ [0,1] for all n, \(\sum _{n=1}^{\infty }a_{n}\alpha _{n} = \infty \) , \(\sum _{n=1}^{\infty }a_{n}c_{n} < \infty \) , \(\sum _{n=1}^{\infty }a_{n}^{2} < \infty \) , and \(\sum _{n=1}^{\infty }a_{n}^{2}c_{n}^{-2} < \infty \) .

  2. 2.

    There exist B,C > 0 such that \(\mathsf{P}(\vert f''(x)\vert \leq B) = 1\) and \(\mathsf{P}(\vert f'(x)\vert \leq C) = 1\) for all x ∈Θ.

  3. 3.

    There exist \(K_{0},K_{1} > 0\) such that \(K_{0}\vert x - x^{{\ast}}\vert \leq \vert f'(x)\vert \leq K_{1}\vert x - x^{{\ast}}\vert \) for all x ∈Θ.

  4. 4.

    f′(x)(x − x ) > 0 for all \(x \in \mathbb{R}\setminus \{x^{{\ast}}\}\) .

  5. 5.

    For c > 0, \(\sigma ^{2} =\sup _{x\in \mathbb{R}}\mathrm{Var}[Y (x + c,\xi ^{+}) - Y (x - c,\xi ^{-})\vert x] < \infty \) for all x ∈Θ.

  6. 6.

    \(\epsilon _{n}^{+}\) , \(\epsilon _{n}^{-}\) , \(\delta _{n}^{+}\) , \(\delta _{n}^{-}\) are i.i.d. with mean zero for all n.

Then \(x_{n}\mathop{ \rightarrow }\limits^{ L^{2}}x^{{\ast}}\) as n →∞.

Numerical experiments show that STAR-SA is competitive against RM and KW, even when the number of iterations for RM are doubled due to the increase in computational cost of STAR-SA [6]. In the experimental results, the STAR-SA algorithm either performs significantly better than RM and KW or the MSE is close to that of the algorithm with the lower MSE. STAR-SA has been extended to higher dimensions by considering simultaneous perturbation gradient estimates, instead of using a symmetric finite difference gradient estimate, to take advantage of its potential efficiency and robustness [5].

6.5 Numerical Experiments

We present three sets of numerical experiments comparing the mean-squared error (MSE) of various SA algorithms on several contrasting functions. The first set of experiments illustrates the sensitivity of KW and two variants to the choice of the two step-size sequence parameters, taken from [7]. The second set compares three SA algorithms, the robust SA (RSA) method, the accelerated SA (AC-SA) method, and the original RM algorithm, under various initial settings and step-size parameters for RM (i.e., starting values, compact intervals, noise levels, and step sizes). The last set of numerical experiments explores the potential gains from using the STAR gradient estimate, which utilizes both direct and indirect gradient estimates, as opposed to using them separately, as in RM and KW, respectively. Since the numerical experiments consider maximization problems, the sign of a n and α n γ n in recursion (6.3) and (6.17), respectively, must be adjusted accordingly.

Sensitivity Analysis of KW and Its Variants

We perform a sensitivity analysis of KW and KW using Kesten’s rule (denoted henceforth by KWK) with symmetric finite difference gradient estimates, and we compare the results with SSKW. Using the parameter settings \(a_{n} =\theta _{a}/n,\ c_{n} =\theta _{c}/n^{1/4}\), \(\ \theta _{a} > 0,\ \theta _{c} > 0\) arbitrary but fixed, N = 10, 000 iterations, and 1,000 sample paths, our analysis replicates the results of [4] for f(x) = −0. 001x 2 on the interval [−50, 50], where SSKW performs significantly better than KW in terms of MSE and oscillatory period; however, this result is obtained using what seem to be nearly worst-case parameter setting for KW. In our experiments, we consider a wide range of parameters and initial settings for KW and KWK: 19 initial starting values uniformly spaced within the truncated interval \(x_{1} \in \{-50 + 5k\mid k = 1,2,\ldots 19\}\), 45 different θ a values parametrized by \(\theta _{a} \in \{ 10^{s}k\mid k = 1,2,\mathop{\ldots },9,s = 0,1,\mathop{\ldots },4\}\), and 10 different θ c values parametrized by \(\theta _{c} \in \{ 10^{s}k\mid k = 1,2,\ldots,5,s = 0,1\}\). In total, there are 8,550 combinations.

Fig. 6.2
figure 2

MSE of the 10000th iterate of KW and KWK for three parameter settings and SSKW for f(x) = −. 001x 2, σ = 0. 001, \(a_{n} =\theta _{a}/n,\ c_{n} =\theta _{c}/n^{1/4}\)

The numerical results illustrate the sensitivity of the classical SA methods to the parameters. In fact, near optimal performance can be obtained with fine-tuning. Out of the 8,550 combinations, KW outperforms SSKW in half of the cases, which indicates that with some tuning, KW yields good performance for a fairly wide range of tunable parameters. Figure 6.2 plots the MSE of KW, KWK, and SSKW for \(f(x) = -.001x^{2},\sigma = 0.001\) against the initial starting values x 1 for several parameter choices that are a good representation of a majority of the results. The parameter value \(\theta _{a} =\theta _{c} = 1\), identical to the settings in [4], is among the worst for KW and KWK, represented by a nearly vertical orange line for both algorithms, as a result of the overlapping red and yellow lines for KW and KWK, respectively. For this parameter setting, SSKW beats KW and KWK significantly for all initial values with the exception of x 1 = 0. The first column in Table 6.1 compares the MSE all three algorithms with x 1 = 0. 01, and clearly, KW outperforms SSKW in almost all cases. Of course, a practitioner would have no way of knowing whether or not the starting iterate was close to the true optimum, so these results do not indicate that KW will always perform well. They do indicate, however, that KW exhibits substantial variation in performance. In the case where \(\theta _{a} = 90,\theta _{c} = 5\) and \(\theta _{a} = 10,\theta _{c} = 5\), KW and KWK, respectively, outperform SSKW in a neighborhood around the optimum. There are also well-tuned parameters such as \(\theta _{a} = 500,\ \theta _{c} = 4\) for KW and \(\theta _{a} = 100,\ \theta _{c} = 1\) for KWK that outperform SSKW for all initial start values. When KW and KWK perform better than SSKW, the difference is not as pronounced as when SSKW outperforms KW, but careful tuning can partially mitigate the sensitivity of KW to parameters such as the initial iterate.

Table 6.1 MSE of the 100th, 1000th, and 10000th iteration for KW and its variants with
Fig. 6.3
figure 3

MSE Comparison of KW and its variants for \(f(x) = 100e^{-.006x^{2} }\) with values a n  = 1∕n, \(c_{n} = 1/n^{1/4}\), N = 10000

In addition, we implement KW and its variants using the same parameters (\(a_{n} = 1/n,\ c_{n} = 1/n^{1/4},\ x_{1} = 30\)) as in [4] on \(f(x) = 100e^{-.006x^{2} }\) to test the algorithms under the same setting for a different function. Figure 6.3 plots the MSE of the 10000th iterate as a function of the initial start value. The horizontal line for all noise levels indicates that SSKW is insensitive the initial start value. KW and KWK outperform SSKW within certain intervals around the optimum for σ ∈ { 0. 001, 0. 01, 0. 1, 1. 0} and KWK’s better performance intervals overlap those of KW. However, KW using the deterministic step size 1∕n performs better than KWK where the intervals overlap, which can be seen in Fig. 6.3. Unfortunately, outside of those intervals, KW and KWK have a tendency to perform poorly.

RM, RM with Iterate Averaging, Robust SA and Accelerated SA

We investigate the MSE performance of RSA and AC-SA using direct gradient estimates and compare the results against the classical RM algorithm and RM with iterate averaging. We consider the optimal parameter settings for RSA and AC-SA, which require additional knowledge of the function, its gradient, and the optimum, so in practice, they must be approximated.

We consider a simple quadratic function, \(f(x) = -\frac{1} {3}x^{2}\), on the truncated intervals [−50, 50] and [−5, 95] with x 1 = 30. 0, σ = 1. 0, and 1,000 sample paths. For the RM and RM with iterate averaging algorithm, we employ a common step size \(a_{n} =\theta _{a}/n\), where θ a  = 10. 0. RM performed relatively well for a wide range of multiplicative constants. We chose to use θ a  = 10. 0, although it did not yield the lowest MSE at the 1000th or 10000th iteration from preliminary numerical tests. For RSA, we adopt a constant step size that minimizes the finite-time bound in (6.8), where C = 100∕3, 190∕3 for the intervals [-50, 50] and [-5, 95], respectively, and D Θ  = 100. For the AC-SA algorithm, we consider α n  = 2∕(n + 1) and γ n  = 4γ∕[n(n + 1)], where γ is given in (6.19) with ν = 1, L = 2∕3 and M = 0.

Figure 6.4 plots the MSE as a function of the number of iterations from 1 to 10000 on a log scale. The results for both the centered and skew truncated intervals appear to have the same behavior across all four algorithms. RM performs well with a good parameter choice, although it is not the best, but averaging the iterates improves the performance, resulting in a smoother monotonically decreasing MSE curve as the number of iterations increase. Compared to a decently/reasonably tuned RM and RM with iterate averaging algorithm, RSA appears to be inferior, at least in this simple numerical experiment. The most interesting curve is from the AC-SA algorithm, where one can observe periodic oscillations, which decrease in magnitude as the number of iterations increase. We further investigated this behavior by analyzing individual sample paths, and the estimates \(\{x_{n}^{ag}\}\) appear to have the same behavior, following a smooth oscillating path/curve. From Fig. 6.4, the AC-SA curve appears to level off and hover slightly over the RSA curve. The stopping time dictates the relative performance of AC-SA when there are a smaller number of iterations because of the oscillations. For the case of the skewed interval, there is a small range of iterations where AC-SA outperforms RSA, RM, and RM with iterate averaging, as well as other small ranges where it outperforms RSA. Keep in mind that these experiments are for a simple quadratic function for a particular setting, so the relative performance will most likely change in a different setting.

Fig. 6.4
figure 4

MSE under RM, RM with averaging, RSA, and AC-SA for \(f(x) = -\frac{1} {3}x^{2}\), x 1 = 30. 0, σ = 1. 0, for symmetrical [−50, 50] (left graph) and skewed [−5, 95] (right graph) intervals

From our numerical experiments, one can conclude that RM and RM with iterate averaging has the potential to outperform RSA and AC-SA if the step-size parameter is chosen appropriately for a wide range of choices. In this case, iterate averaging improves the performance of RM for all 10,000 iterations. Both the AC-SA and RSA algorithms require additional knowledge to choose the optimal step size that minimizes the bound in (6.15) and (6.8) for AC-SA and RSA, respectively.

STAR-SA, RM, and KW

We implement STAR-SA, RM, and KW under various combinations of noise levels σ f and σ g , for f(x) = −ax 2, a > 0 and Θ = [−50, 50]. The gain sequence and finite difference step sizes are θ a n and \(\theta _{c}/n^{1/4}\), respectively, and the MSE results are based on 1,000 sample paths. For a fairer comparison, the number of iterations for RM is doubled, since STAR-SA and KW require twice the number of sample path runs. We consider the following values for the parameter and initial settings: steepness level \(a \in \{ 10^{k}\vert k = -3,-2.5,\mathop{\ldots },1.5,2\}\), \(x_{1} \in \{-50 + 5k\vert k = 1,\mathop{\ldots },10\}\), θ a  ∈ { 1, 10, 100}, θ c  ∈ { 0. 1, 1. 0}, \(\sigma _{f} \in \{ 10^{k}\vert k = -3,\mathop{\ldots },1\}\), \(\sigma _{g} \in \{ 10^{k}\vert k = -3,\mathop{\ldots },1\}\), and N ∈ { 100, 1000, 10000}. Although STAR-SA, RM, and KW were implemented for all settings, only the case f(x) = −0. 1x 2, \(a_{n} = 10n^{-1}\), \(c_{n} = 0.1n^{-1/4}\), σ f  ∈ { 0. 001, 0. 1, 1. 0}, σ g  ∈ { 0. 001, 0. 1, 1. 0} and N = 1000 will be described in detail.

Fig. 6.5
figure 5

MSE of 1000th iterate under STAR-SA, RM, and KW for f(x) = −0. 1x 2, σ g  = 1. 0, and two levels of σ f : 0.1 (left graph) and 1.0 (right graph)

The STAR-SA algorithm outperforms KW and RM for 6 out of the 9 combinations for all initial start values considered. For σ f  = 0. 001 and \(\sigma _{f} <\sigma _{g}\), the MSE of STAR-SA is lower than that of RM, but is approximately equal to that of KW. The only case where the MSE of the STAR-SA algorithm is not approximately less than or equal to the MSE of KW and RM is when both noise levels are very low, i.e., \(\sigma _{f} =\sigma _{g} = 0.001\), which is not shown. In this case, RM performs better than STAR-SA except when the start value is close to the optimum x  = 0. In fact, the MSE of STAR-SA decreases as x 1 approaches x . In addition, the MSE of KW and RM are close to the optimum when σ f  = 0. 001 and σ g  = 0. 001, 0. 1, respectively, whereas the MSE of STAR-SA is close to 0 for \((\sigma _{f},\sigma _{g}) \in \{ (0.001,0.001),(0.001,0.1),(0.001,1.0),\) (0. 1, 0. 001), (0. 1, 0. 1), (1. 0, 0. 001), (1. 0, 0. 1)}. Figure 6.5 illustrates the MSE results when σ f  = 1. 0. When the noise of the function is high, KW performs poorly, RM outperforms KW, and STAR-SA has the lowest MSE. Figure 6.5 shows a case where the performance of KW and RM are similar, but the MSE of the STAR-SA algorithm is lower. Overall, from the numerical experiments conducted, STAR-SA either performs significantly better than both RM and KW in terms of MSE or the MSE is approximately equal to that of the algorithm with the lower MSE.

6.6 Concluding Remarks

Stochastic approximation has an enormous body of literature in all aspects of theory, algorithms, and applications. From its origins in statistics, it has now reached many disciplines in engineering and the social sciences, with well-known successes in such areas as signal processing, pattern recognition, and machine learning. Clearly, simulation optimization is another fertile area for its application.

This chapter introduced the two main versions of SA:

  • KW-like methods that rely only on function estimates, known as gradient-free or stochastic zeroth-order algorithms; and

  • RM-like methods that make use of direct estimates of first-order derivative information, known as stochastic gradient or stochastic first-order algorithms.

The latter methods generally perform better in practice, but they require information that is not always available. Asymptotically, they can obtain an O(n −1∕2) convergence rate, whereas the former are generally limited to a O(n −1∕3) convergence rate. Among the gradient-free methods, SPSA has been particularly successful for high-dimensional problems.

The finite-time behavior of any SA algorithm depends heavily on the choice of the step-size or gain sequence, and various approaches to handling this challenge have been presented, from Kesten’s rule to iterate averaging, with the latter procedure highly recommended.

Classical notions of convergence in SA address the iterates {x n }, whereas recent finite-time analysis has turned to the properties of the function values {f(x n )}. Chapter 7 focuses on some recent SA algorithms mainly tailored to convex stochastic programming problems and provides such convergence properties.

Finally, SA methods are aimed at continuous-valued optimization problems, but there is some work attempting to apply SA to discrete optimization problems. A recent Ph.D. dissertation [44] addresses this setting and includes a summary of previous work in the area.