Abstract
In this paper we present a derivative-free optimization algorithm for finite minimax problems. The algorithm calculates an approximate gradient for each of the active functions of the finite max function and uses these to generate an approximate subdifferential. The negative projection of 0 onto this set is used as a descent direction in an Armijo-like line search. We also present a robust version of the algorithm, which uses the ‘almost active’ functions of the finite max function in the calculation of the approximate subdifferential. Convergence results are presented for both algorithms, showing that either f(x k)→−∞ or every cluster point is a Clarke stationary point. Theoretical and numerical results are presented for three specific approximate gradients: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithms, the three approximate gradients, and a regular and robust stopping condition.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper we consider the finite minimax problem:
where each individual f i is continuously differentiable. We further restrict ourselves to the field of derivative-free optimization (DFO), where we are only permitted to compute function values, i.e., we cannot compute gradient values ∇f i directly. We present a derivative-free algorithm that exploits the smooth substructure of the finite max problem, thereby creating a robust algorithm with an elegant convergence theory.
Finite minimax problems occur in numerous applications, such as portfolio optimization [8], control system design [21], engineering design [32], and determining the cosine measure of a positive spanning set [10, Def. 2.7]. In a finite max function, although each individual f i may be smooth, taking the maximum forms a nonsmooth function with ‘nondifferentiable ridges’. For this reason, most algorithms designed to solve finite minimax problems employ some form of smoothing technique; [31, 33, 34] and [39] (among many others). In general, these smoothing techniques require gradient calculations.
However, in many situations gradient information is not available or can be difficult to compute accurately (see [4, 15, 20, 29] and [10, Chap. 1] for some examples of such situations). Such situations are considered by research in the area of derivative-free optimization. For a thorough introduction to several basic DFO frameworks and convergence results for each, see [10].
Research on optimizing finite max functions without calculating derivatives can be seen as early as 1975 [28], while more recently we have seen a resurface in this area [26] and [19].
In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential penalty function in a directional direct-search framework to form a derivative-free optimization method for finite minimax problems [26]. This method is shown to globally converge towards a standard stationary point of the original finite minimax problem.
Also specific to the finite minimax problem, a derivative-free method is presented in [19] that exploits the smooth substructure of the problem. It combines the frameworks of a directional direct search method [10, Chap. 7] and the gradient sampling algorithm (GS algorithm) presented in [6] and [7]. Loosely speaking, the GS algorithm uses a collection of local gradients to build a ‘robust subdifferential’ of the objective function and uses this to determine a ‘robust descent direction’. In [19], these ideas are used to develop several methods to find an approximate descent direction that moves close to parallel to an ‘active manifold’. During each iteration, points are sampled from around the current iterate and the simplex gradient is calculated for each of the active functions of the objective function. The calculated simplex gradients are then used to form an approximate subdifferential, which is then used to determine a likely descent direction.
Ideas from the GS algorithm have appeared in two other recent DFO methodologies [2] and [24].
In 2008, Bagirov, Karasözen and Sezer presented a discrete gradient derivative-free method for unconstrained nonsmooth optimization problems [2]. Described as a derivative-free version of the bundle method presented in [37], the method uses discrete gradients to approximate subgradients of the function and build an approximate subdifferential. The analysis of this method provides proof of convergence to a Clarke stationary point for an extensive class of nonsmooth problems. In this paper, we focus on the finite minimax problem. This allows us to require few (other) assumptions on our function while maintaining strong convergence analysis. It is worth noting that we use the same set of test problems as in [2]. Specifically, we use the [27] test set and exclude one problem as its sub-functions are complex-valued. (The numerics in [2] exclude the same problem, and several others, without explanation.)
Using approximate gradient calculations instead of gradient calculations, the GS algorithm is made derivative free by Kiwiel in [24]. Specifically, Kiwiel employs the Gupal estimate of the gradient of the Steklov averaged function (see [18] or Sect. 4.3 herein) as an approximate gradient. It is shown that, with probability 1, this derivative-free algorithm satisfies the same convergence results as the GS algorithm—it either drives the f-values to −∞ or each cluster point is found to be Clarke stationary [24, Theorem 3.8]. No numerical results are presented for Kiwiel’s derivative-free algorithm.
In this paper, we use the GS algorithm framework with approximate gradients to form a derivative-free approximate gradient sampling algorithm. As we are dealing with finite max functions, instead of calculating an approximate gradient at each of the sampled points, we calculate an approximate gradient for each of the active functions. Expanding the active set to include ‘almost’ active functions, we also present a robust version of our algorithm, which is more akin to the GS algorithm. In this robust version, when our iterate is close to a point of nondifferentiability, the size and shape of our approximate subdifferential will reflect the presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdifferential, the descent direction will direct minimization parallel to a ‘nondifferentiable ridge’, rather than straight at this ridge. It can be seen in our numerical results that these robust changes greatly influence the performance of our algorithm.
Our algorithm differs from the above in a few key manners. Unlike in [26] we do not employ a smoothing technique. Unlike in [19], which uses the directional direct-search framework to imply convergence, we employ an approximate steepest descent framework. Using this framework, we are able to analyze convergence directly and develop stopping conditions for the algorithm. Unlike in [2] and [24], where convergence is proven for a specific approximate gradient, we prove convergence for any approximate gradient that satisfies a simple error bound dependent on the sampling radius. As examples, we present the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. (As a side-note, Sect. 4.3 also provides, to the best of the authors’ knowledge, novel error analysis of the Gupal estimate of the gradient of the Steklov averaged function.)
Focusing on the finite minimax problem provides us with an advantage over the methods of [2] and [24]. In particular, we only require order n function calls per iteration (where n is the dimension of the problem), while both [2] and [24] require order mn function calls per iteration (where m is the number of gradients they approximate to build their approximate subdifferential). (The original GS algorithm suggests that m≈2n provides a good value for m.)
The remainder of this paper is organized as follows. In Sect. 2, we present the approximate gradient sampling algorithm (AGS algorithm) and our convergence analysis. In Sect. 3, we present a robust version of the AGS algorithm (RAGS algorithm), which uses ‘almost active’ functions in the calculation of the approximate subdifferential. In Sect. 4, we show that the AGS and RAGS algorithms converge using three specific approximate gradients: simplex gradient, centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. Finally, in Sect. 5, we present our numerical results and analysis.
2 Approximate gradient sampling algorithm
Throughout this paper, we assume that our objective function is of the form
where each \(f_{i} \in\mathcal{C}^{1}\), but we cannot compute ∇f i . We use \(\mathcal{C}^{1}\) to denote the class of differentiable functions whose gradient mapping ∇f is continuous. We denote by \(\mathcal{C}^{1+}\) the class of continuously differentiable functions whose gradient mapping ∇f is locally Lipschitz and we denote by \(\mathcal{C}^{2+}\) the class of twice continuously differentiable functions whose Hessian mapping ∇2 f is locally Lipschitz. Additionally, throughout this paper, |⋅| denotes the Euclidean norm and ∥⋅∥ denotes the corresponding matrix norm.
For the finite max function in (1), we define the active set of f at a point \(\bar{x}\) to be the set of indices
The set of active gradients of f at \(\bar{x}\) is denoted by
Let f be locally Lipschitz at a point \(\bar{x}\). As f is Lipschitz, there exists an open dense set D⊂ℝn such that f is continuously differentiable on D. The Clarke subdifferential [9] is constructed via
For a finite max function, assuming \(f_{i} \in\mathcal{C}^{1}\) for each \(i \in A(\bar{x})\), the Clarke subdifferential (as proven in [9, Prop. 2.3.12]) is equivalent to
By (2), it is clear that for finite max functions the subdifferential is a compact set. This will be important in the convergence analysis in Sect. 2.2.
We are now ready to state the general form of the AGS algorithm, an approximate subgradient descent method.
2.1 Algorithm—AGS
We first provide a partial glossary of notation used in the definition of the AGS algorithm (see Table 1).
Conceptual Algorithm: [Approximate Gradient Sampling Algorithm]
-
0.
Initialize: Set k=0 and input
-
x 0—starting point
-
μ 0>0—accuracy measure
-
Δ0>0—initial sampling radius
-
θ∈(0,1)—sampling radius reduction factor
-
0<η<1—Armijo-like parameter
-
t min—minimum step length
-
ε tol >0—stopping tolerance
-
-
1.
Generate Approximate Subdifferential G k:
Generate a set Y=[x k,y 1,…,y m] around the current iterate x k such that
$$\max_{j=1, \ldots, m} \bigl|y^j - x^k\bigr| \le \Delta^k. $$Use Y to calculate the approximate gradient of f i , denoted ∇ A f i , at x k for each i∈A(x k). Set
$$G^k = \operatorname{conv}\bigl\{ \nabla_A f_i \bigl(x^k\bigr)\bigr\}_{i \in A(x^k)}. $$ -
2.
Generate Search Direction:
Let
$$d^k = -\operatorname{Proj}\bigl(0|G^k\bigr). $$Check if
$$ \Delta^k \le\mu^k \bigl|d^k\bigr|. $$(3)If (3) does not hold, then set x k+1=x k,
$$ \Delta^{k+1}= \begin{cases} \theta\mu^k |d^k| & \text{if $|d^{k}| \not= 0$}, \\ \theta\Delta^k & \text{if $|d^{k}| = 0$}, \end{cases} $$(4)k=k+1 and return to Step 1. If (3) holds and |d k|<ε tol , then STOP. Else, continue to the line search.
-
3.
Line Search:
Attempt to find t k>0 such that
$$f\bigl(x^k +t^k d^k\bigr) < f \bigl(x^k\bigr) - \eta t^k \bigl|d^k\bigr|^2. $$Line Search Failure:
Set \(\mu^{k+1} = \frac{\mu^{k}}{2}\), x k+1=x k and go to Step 4.
Line Search Success:
Let x k+1 be any point such that
$$f\bigl(x^{k+1}\bigr) \le f\bigl(x^k +t^k d^k\bigr). $$ -
4.
Update and Loop:
Set Δk+1=max j=1,…,m |y j−x k|, k=k+1 and return to Step 1.
In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting point x 0, and initialize the parameter values.
In Step 1, we create the approximate subdifferential. First, we select a set of points around x k within a sampling radius of Δk. In implementation, the points are randomly and uniformly sampled from a ball of radius Δk (using the MATLAB randsphere.m function [36]). Using this set Y, we then calculate an approximate gradient for each of the active functions at x k and set the approximate subdifferential G k equal to the convex hull of these active approximate gradients, ∇ A f i (x k). Details on various approximate gradients appear in Sect. 4.
In Step 2, we generate a search direction by solving the projection of 0 onto the approximate subdifferential: \(\operatorname{Proj}(0|G^{k}) \in\operatorname{arg\,min} _{g \in G^{k}} |g|^{2}\). The search direction d k is set equal to the negative of the solution, i.e., \(d^{k} = -\operatorname{Proj}(0|G^{k})\).
After finding a search direction, we check the inequality Δk≤μ k|d k|. This inequality determines if the current sampling radius is sufficiently small relative to the distance from 0 to the approximate subdifferential. If this inequality holds and |d k|<ε tol , then we terminate the algorithm, as 0 is within ε tol of the approximate subdifferential and the sampling radius is small enough to reason that the approximate subdifferential is accurate. If the above inequality does not hold, then the approximate subdifferential is not sufficiently accurate to warrant a line search, so we decrease the sampling radius, set x k+1=x k, update the iterate counter and loop (Step 1). If the above inequality holds, but |d k|≥ε tol , then we proceed to a line search.
In Step 3, we carry out a line search. We attempt to find a step length t k>0 such that the Armijo-like condition holds
This condition ensures sufficient decrease is found in the function value. In implementation, we use a back-tracking line search (described in [30]) with an initial step-length of t ini=1, terminating when the step length t k is less than a threshold t min. If we find a t k such that (5) holds, then we declare a line search success. If not, then we declare a line search failure.
If a line search success occurs, then we let x k+1 be any point such that
In implementation, we do this by searching through the function values used in the calculation of our approximate gradients (\(\{f(y^{i})\}_{y^{i} \in Y}\)). As this set of function values corresponds to points distributed around our current iterate, there is a good possibility of finding further function value decrease without having to carry out additional function evaluations. We find the minimum function value in our set of evaluations and if (6) holds for this minimum value, then we set x k+1 equal to the corresponding input point. Otherwise, we set x k+1=x k+t k d k.
If a line search failure occurs, then we reduce the accuracy measure μ k by a factor of \(\frac{1}{2}\) and set x k+1=x k.
Finally, in Step 4, we update the iterate counter and the sampling radius, and then loop to Step 1 to resample.
2.2 Convergence
For the following results, we denote the approximate subdifferential of f at \(\bar{x}\) as
where \(\nabla_{A} f_{i} (\bar{x})\) is the approximate gradient of f i at \(\bar{x}\). Our first result establishes an error bound relation between the elements of the approximate subdifferential and the exact subdifferential.
Lemma 2.1
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla _{A} f_{i} (\bar{x}) - \nabla f_{i}(\bar{x})| \le\varepsilon\) for all i=1,…,N. Then
-
1.
for all \(w \in G(\bar{x})\), there exists a \(v \in\partial f(\bar{x})\) such that |w−v|≤ε, and
-
2.
for all \(v \in\partial f(\bar{x})\), there exists a \(w \in G(\bar{x})\) such that |w−v|≤ε.
Proof
1. By definition, for all \(w \in G(\bar{x})\) there exists a set of α i such that
By our assumption that each \(f_{i} \in\mathcal{C}^{1}\), we have \(\partial f(\bar{x})= \operatorname{conv}\{\nabla f_{i}(\bar{x})\}_{i \in A(\bar{x})}\). Using the same α i as above, we see that
Then
Hence, for all \(w \in G(\bar{x})\), there exists a \(v \in\partial f(\bar{x})\) such that
2. Analogous arguments can be applied to \(v \in\partial f(\bar{x})\). □
Lemma 2.1 states the quality of the approximate subdifferential as an approximation to the exact subdifferential once the approximate gradients of the component functions are quality approximations to the real gradients. Our next goal (in Theorem 2.4) is to show that eventually a line search success will occur in the AGS algorithm. To achieve this we make use of the following lemma.
Lemma 2.2
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le\varepsilon\) for all i=1,…,N. Define \(d=-\operatorname{Proj} (0|G(\bar{x}))\) and suppose \(|d| \not= 0\). Let β∈[0,1). If ε<(1−β)|d|, then for all \(v \in\partial f(\bar {x})\) we have
Proof
Notice that, by the Projection Theorem [3, Theorem 3.14], \(d=-\operatorname{Proj}(0|G(\bar{x}))\) implies that
Hence,
So we have for all \(v \in\partial f(\bar{x})\),
For any \(v \in\partial f(\bar{x})\), using w as constructed in Lemma 2.1, we see that
□
Remark 2.3
In Lemma 2.2, for the case when β=0, the condition ε<(1−β)|d| simplifies to ε<|d|. Thus, if ε is bounded above by |d|, then Lemma 2.2 proves that for all \(v \in\partial f(\bar{x})\) we have 〈d,v〉<0, showing that d is a descent direction for f at \(\bar{x}\).
To guarantee convergence, we must show that, except in the case of 0∈∂f(x k), the algorithm will always be able to find a sampling radius that satisfies the requirements in Step 2. In Sect. 4 we show that (for three different approximate gradients) the value ε (in Lemma 2.2) is linked to Δ. As unsuccessful line searches will drive Δ to zero, this implies that eventually the requirements of Lemma 2.2 will be satisfied. We formalize this in the next two theorems.
Theorem 2.4
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose \(0 \not\in\partial f(x^{k})\) for each iteration k. Suppose there exists \(\bar{K} >0 \) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies \(|\nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Let \(d^{k} = - \operatorname{Proj}(0|G(x^{k}))\). Then for any μ>0, there exists \(\bar{\Delta}=\bar{\Delta}(x^{k})>0\) such that,
Moreover, if \(\Delta^{k} < \bar{\Delta}\), then the following inequality holds
Proof
Let \(\bar{v} = \operatorname{Proj}(0|\partial f(x^{k}))\) (by assumption, \(\bar{v} \neq0\)).
Given μ>0, let
and consider \(0 < \Delta< \bar{\Delta}\). Now create G(x k) and \(d^{k} = - \operatorname{Proj}(0|G(x^{k}))\). As −d k∈G(x k), by Lemma 2.1(1), there exists a v k∈∂f(x k) such that
Then
Thus, for \(0 < \Delta< \bar{\Delta}\), we apply (9) to \(|\bar{v}|\) in the above inequality to get
which rearranges to
Hence, \(\Delta\le\mu|d^{k}| + \bar{K} \mu(\Delta^{k} - \Delta)\) for all \(0 < \Delta< \bar{\Delta}\). Finally, if \(\Delta^{k} < \bar {\Delta}\), then
□
Remark 2.5
In Theorem 2.4, it is important to note that eventually the condition \(\Delta^{k} < \bar{\Delta}\) will hold. Examine \(\bar {\Delta}\) as constructed above: \(\bar{K}\) is a constant and \(\bar {v}\) is associated with the current iterate. However, the current iterate is only updated when a line search success occurs, which will not occur unless the condition Δk≤μ k|d k| is satisfied. As a result, if \(\Delta^{k} \ge\bar{\Delta}\), the AGS algorithm will reduce Δk, with \(\bar{\Delta}\) remaining constant, until \(\Delta^{k} <\bar{\Delta}\).
Recall in Step 3 of the AGS algorithm, for a given η∈(0,1), we attempt to find a step length t k>0 such that
The following result shows that eventually the above inequality will hold in the AGS algorithm. Recall that the exact subdifferential for a finite max function, as defined in (2), is a compact set. Thus, we know that in the following theorem \(\tilde{v}\) is well-defined.
Theorem 2.6
Fix 0<η<1. Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal{C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le \varepsilon\) for all i=1,…,N. Define \(d=-\operatorname{Proj}(0|G(\bar{x}))\) and suppose \(|d| \not= 0\). Let \(\tilde {v} \in\operatorname{arg\,max}\{ \langle d,v \rangle: v \in\partial f(\bar{x})\}\). Let \(\beta= \frac{2\eta}{1+\eta}\). If ε<(1−β)|d|, then there exists \(\bar{t} >0\) such that
Proof
Note that β∈(0,1). Recall, from Lemma 2.2, we have for all \(v \in\partial f(\bar{x})\)
Using \(\beta= \frac{2\eta}{1+\eta}\), (10) becomes
From (11) we can conclude that for all \(v \in \partial f(\bar{x})\)
Notice that
Therefore, there exists \(\bar{t} > 0\) such that
For such a t, we have
Hence,
□
Combining the previous results, we can show that the AGS algorithm is guaranteed to find function value decrease (provided 0∉∂f(x k)). We summarize with the following corollary.
Corollary 2.7
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose 0∉∂f(x k) for each iteration k. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k}) - \nabla f_{i} (x^{k})| \le\bar{K} \Delta^{k}\) for all i=1,…,N. Then after a finite number of iterations, the algorithm will find a new iterate with a lower function value.
Proof
Consider x k, where 0∉∂f(x k).
To find function value decrease with the AGS algorithm, we must declare a line search success in Step 3. The AGS algorithm will only carry out a line search if the condition below is satisfied
where \(d^{k} = -\operatorname{Proj}(0|G(x^{k}))\), as usual. In Theorem 2.4, we showed that for any μ k>0, there exists a \(\bar {\Delta}=\bar{\Delta}(x^{k})>0\) such that if \(\Delta^{k} < \bar{\Delta }(x^{k})\), then (12) is satisfied. If (12) is not satisfied, then Δk is updated according to (4) and x k+1=x k, which further implies \(\bar{\Delta} = \bar{\Delta}(x^{k+1})=\bar{\Delta}(x^{k})\) is unchanged. In this case, whether \(|d^{k}| \not= 0\) or |d k|=0, we can see that Δk+1≤θΔk. Hence an infinite sequence of (12) being unsatisfied is impossible (as eventually we would have \(\Delta^{k} < \bar{\Delta}\)). So eventually (12) will be satisfied and the AGS algorithm will carry out a line search.
Now, in order to have a line search success, we must be able to find a step length t k such that the Armijo-like condition holds,
In Theorem 2.6, we showed that there exists \(\bar{t} >0\) such that
provided that for β∈(0,1),
Set \(\varepsilon= \bar{K} \Delta^{k}\). If (13) does not hold, then a line search failure will occur, resulting in μ k+1=0.5μ k. Thus, eventually we will have \(\mu^{k} < \frac {(1-\beta)}{\bar{K}}\) and
which means (13) will hold. Thus, after a finite number of iterations, the AGS algorithm will declare a line search success and find a new iterate with a lower function value. □
We are now ready to prove convergence. In particular, we study the limiting case of the algorithm generating an infinite sequence (i.e., the situation with ε tol =0). In the following, assuming that the step length t k is bounded away from 0 means that there exists a \(\bar{t} > 0\) such that \(t^{k}>\bar{t}\).
Theorem 2.8
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Set ε tol =0 and suppose that \(\{x^{k}\}^{\infty}_{k=0}\) is an infinite sequence generated by the AGS algorithm. Suppose there exists a \(\bar{K} >0 \) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies the error bound \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le \bar{K} \Delta^{k}\) for all i=1,…,N. Suppose t k is bounded away from 0. Then either
-
1.
f(x k)↓−∞, or
-
2.
|d k|→0, Δk↓0 and every cluster point \(\bar{x}\) of the sequence \(\{ x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).
Proof
If f(x k)↓−∞, then we are done.
Conversely, if f(x k) is bounded below, then f(x k) is non-increasing and bounded below, therefore f(x k) converges. We consider two cases.
Case 1: An infinite number of line search successes occur.
Let \(\bar{x}\) be a cluster point of \(\{x^{k}\}^{\infty}_{k=0}\). Notice that x k only changes for line search successes, so there exists a subsequence \(\{x^{k_{j}}\}^{\infty}_{j=0}\) of line search successes such that \(x^{k_{j}} \to\bar{x}\). Then for each corresponding step length \(t^{k_{j}}\) and direction \(d^{k_{j}}\), the following condition holds
Note that
Since f(x k) converges we know that \(f(x^{k_{j}}) - f(x^{k_{j}+1}) \rightarrow0\). Since \(t^{k_{j}}\) is bounded away from 0, we see that
Recall from the AGS algorithm, we check the condition
As \(\Delta^{k_{j}} >0\), \(\mu^{k_{j}} \le\mu^{0}\), and \(|d^{k_{j}}| \to0\), we conclude that \(\Delta^{k_{j}} \downarrow0\).
Finally, from Lemma 2.1(1), as \(-d^{k_{j}} \in G(x^{k_{j}})\), there exists a \(v^{k_{j}} \in\partial f(x^{k_{j}})\) such that
which implies that
So,
where \(|v^{k_{j}}| \ge\operatorname{dist}(0|\partial f(x^{k_{j}}))\geq 0\), which implies \(\operatorname{dist}(0|\partial f(x^{k_{j}})) \to0\). We have \(x^{k_{j}} \to \bar{x}\). As f is a finite max function, ∂f is outer semicontinuous (see [35, Definition 5.4 & Proposition 8.7]). Hence, every cluster point \(\bar{x}\) of a convergent subsequence of \(\{x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).
Case 2: A finite number of line search successes occur.
This means there exists a \(\bar{k}\) such that \(x^{k} = x^{\bar{k}} =\bar{x}\) for all \(k \ge\bar{k}\). However, by Corollary 2.7, if \(0 \notin\partial f(\bar{x})\), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have \(0 \in\partial f(\bar{x})\).
To see Δk↓0 and |d k|→0, note that by Lemma 2.1(1) and \(0 \in\partial f(\bar{x})\), we have that for all \(k > \bar{k}\) there exists d∈G(x k) such that \(|d - 0| \leq\bar{K} \Delta^{k}\). In particular, \(|d^{k}| = |\operatorname{Proj}(0|G(x^{k}))| \leq \bar{K}\Delta^{k} \leq\bar{K}\Delta^{0}\) for all \(k > \bar{k}\). Now note that one of two situations must occur: either (3) is unsatisfied an infinite number of times or after a finite number of steps (3) is always satisfied and a line search failure occurs in Step 4. In the first case, we directly have Δk↓0 (by Step 3). In the second case, we have μ k↓0 (by Step 4), so \(\Delta^{k} \leq\mu^{k}|d^{k}| \leq\mu^{k}\bar{K}\Delta^{0}\) (by (3)) implies Δk↓0. Finally, \(|d^{k}| \leq\bar{K}\Delta^{k}\) completes the proof. □
Our last result shows that if the algorithm terminates in Step 2, then the distance from 0 to the exact subdifferential is controlled by ε tol .
Theorem 2.9
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists a \(\bar{K} >0\) such that for each iteration k, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le\bar{K} \Delta^{k}\) for all i=1,…,N. Suppose the AGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 for ε tol >0. Then
Proof
Let \(\bar{w} = \operatorname{Proj}(0|G(x^{k}))\). We use \(\bar{v} \in \partial f(x^{k})\) as constructed in Lemma 2.1(1) to see that
The final statement now follows by the test Δk≤μ k|d k|≤μ 0 ε tol in Step 2. □
3 Robust approximate gradient sampling algorithm
The AGS algorithm depends on the active set of functions at each iterate, A(x k). Of course, it is possible at various times in the algorithm for there to be functions that are inactive at the current iterate, but active within a small radius of the current iterate. Typically such behaviour means that the current iterate is close to a ‘nondifferentiable ridge’ formed by the function. In [6] and [7], it is suggested that allowing an algorithm to take into account these ‘almost active’ functions will provide a better idea of what is happening at and around the current iterate, thus, making the algorithm more robust.
In this section we present the robust gradient sampling algorithm (RAGS algorithm). Specifically, we adapt the AGS algorithm by expanding our active set to include all functions that are active at any of the points in the set Y. Recall from the AGS algorithm that the set Y is sampled from within a ball of radius Δk. Thus, the points in Y are not far from the current iterate. We define the robust active set next.
Definition 3.1
Let f=max{f i :i=1,…,N} where \(f_{i} \in\mathcal{C}^{1}\). Let y 0=x k be the current iterate and Y=[y 0,y 1,y 2,…,y m] be a set of randomly sampled points from a ball centered at y 0 with radius Δk. The robust active set of f on Y is
3.1 Algorithm—RAGS
Using the idea of the robust active set, we alter the AGS algorithm to accommodate the robust active set by replacing Steps 1 and 2 with the following.
-
1.
Generate Approximate Subdifferential \(G^{k}_{Y}\) (Robust):
Generate a set Y=[x k,y 1,…,y m] around the current iterate x k such that
$$\max_{j=1, \ldots, m} \bigl|y^j - x^k\bigr| \le \Delta^k. $$Use Y to calculate the approximate gradient of f i , denoted ∇ A f i , at x k for each i∈A(Y). Then set \(G^{k} = \operatorname{conv}\{ \nabla_{A} f_{i} (x^{k})\}_{i \in A(x^{k})}\) and \(G^{k}_{Y} = \operatorname{conv}\{ \nabla_{A} f_{i} (x^{k})\}_{i \in A(Y)}\).
-
2.
Generate Search Direction:
Let
$$d^k = -\operatorname{Proj}\bigl(0|G^k\bigr). $$Let
$$d^k_Y = -\operatorname{Proj}\bigl(0|G^k_Y \bigr). $$Check if
$$ \Delta^k \le\mu^k \bigl|d^k\bigr|. $$(14)If (14) does not hold, then set x k+1=x k,
$$ \Delta^{k+1}= \begin{cases} \theta\mu^k |d^k| & \text{if $|d^{k}| \not= 0$}, \\ \theta\Delta^k & \text{if $|d^{k}| = 0$}, \end{cases} $$k=k+1 and return to Step 1. If (14) holds and |d k|<ε tol , then STOP. Else, continue to the line search, using \(d^{k}_{Y}\) as a search direction.
Notice that in Step 2 we still use the stopping conditions from Sect. 2. Although this modification requires the calculation of two projections, it should be noted that neither of these projections are particularly difficult to calculate. In Sect. 3.3, we use the Goldstein approximate subdifferential to adapt Theorem 2.9 to work for stopping conditions based on \(d^{k}_{Y}\), but we still do not have theoretical results for the exact subdifferential. It is important to note that no additional function evaluations are required for this modification.
In the numerics section, we test each version of our algorithm using the robust descent direction to check the stopping conditions. This alteration shows convincing results that the robust stopping conditions not only guarantee convergence, but significantly decrease the number of function evaluations required for the algorithm to converge.
3.2 Convergence
To show that the RAGS algorithm is well-defined we will require the fact that when Δk is small enough, the robust active set is in fact equal to the original active set.
Lemma 3.2
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Let \(Y = [\bar{x}, y^{1}, \ldots,\allowbreak y^{m} ]\) be a randomly sampled set from a ball centered at \(\bar{x}\) with radius Δ. Then there exists an \(\tilde{\varepsilon}>0\) such that if \(Y \subseteq B_{\tilde{\varepsilon}}(\bar{x})\), then \(A(\bar{x})=A(Y)\).
Proof
Clearly, if \(i \in A(\bar{x})\), then i∈A(Y) as \(\bar{x} \in Y\).
Consider \(i \not\in A(\bar{x})\). Then by the definition of f, we have that
By definition, f is continuous, thus, there exists an \(\tilde {\varepsilon}_{i} >0\) such that for all \(z \in B_{\tilde{\varepsilon }_{i}} (\bar{x})\),
If \(\Delta< \tilde{\varepsilon}_{i}\), then we have \(|y^{j} - \bar{x}| < \tilde{\varepsilon}_{i}\) for all j=1,…,m. Therefore,
so \(i \not\in A(Y)\). Setting \(\tilde{\varepsilon} = \min_{i} \tilde{\varepsilon}_{i}\) completes the proof. □
Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-defined when using the robust active set.
Corollary 3.3
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose \(0 \not\in\partial f(x^{k})\) for each iteration k. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k}) | \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Then after a finite number of iterations, the RAGS algorithm will find function value decrease.
Proof
Consider x k, where 0∉∂f(x k).
For eventual contradiction, suppose we do not find function value decrease. In the RAGS algorithm, this corresponds to an infinite number of line search failures. If we have an infinite number of line search failures, then Δk→0, as |d k| is bounded, and \(x^{\bar{k}} = x^{k}\) for all \(\bar{k} \ge k\). In Lemma 3.2, \(\tilde{\varepsilon}\) depends only on x k. Hence, we can conclude that eventually \(\Delta^{k} \le\tilde{\varepsilon}\) and therefore \(Y \subseteq B_{\tilde{\varepsilon}} (x^{k})\). Thus, eventually A(x k)=A(Y k). Once the two active sets are equal, the results of Sect. 2 will hold. □
To examine convergence of the RAGS algorithm we use the result that eventually the robust active set at the current iterate will be a subset of the regular active set at any cluster point of the algorithm.
Lemma 3.4
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Let Y k=[x k,y 1,…,y m] be a randomly sampled set from a ball centered at x k with radius Δk. Let \(x^{k} \to \bar{x}\). Then there exists an \(\tilde{\varepsilon} > 0\) such that if \(Y^{k} \subseteq B_{\tilde{\varepsilon}}(\bar{x})\), then \(A(Y^{k}) \subseteq A(\bar{x})\).
Proof
Let \(i \notin A(\bar{x})\). We must show that for k sufficiently large i∉A(Y k).
By definition of f, we have that
Since f is continuous, there exists an \(\tilde{\varepsilon}_{i}>0\) such that for all \(z \in B_{\tilde{\varepsilon}_{i}} (\bar{x})\)
If \(Y^{k} \subseteq B_{\tilde{\varepsilon}_{i}}(\bar{x})\), then we have \(|x^{k} - \bar{x}| < \tilde{\varepsilon}_{i}\) and \(|y^{j} - \bar{x}| < \tilde{\varepsilon}_{i}\) for all j=1,…,m. Therefore
and
Thus, if \(Y^{k} \subseteq B_{\tilde{\varepsilon}_{i}} (\bar{x})\), then \(i \not\in A(Y^{k})\). Letting \(\tilde{\varepsilon} = \min_{i} \tilde{\varepsilon}_{i}\) completes the proof. □
Now we examine the convergence of the RAGS algorithm.
Theorem 3.5
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Set ε tol =0 and suppose that \(\{x^{k}\}^{\infty}_{k=0}\) is an infinite sequence generated by the RAGS algorithm. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies the error bound \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le \bar{K} \Delta^{k}\) for all i=1,…,N. Suppose t k is bounded away from 0. Then either
-
1.
f(x k)↓−∞, or
-
2.
|d k|→0, Δk↓0 and every cluster point \(\bar{x}\) of the sequence \(\{ x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).
Proof
If f(x k)↓−∞, then we are done.
Conversely, if f(x k) is bounded below, then f(x k) is non-increasing and bounded below, therefore f(x k) converges. We consider two cases.
Case 1: An infinite number of line search successes occur.
Let \(\bar{x}\) be a cluster point of \(\{ x^{k}\}^{\infty}_{k=0}\). Notice that x k only changes for line search successes, so there exists a subsequence \(\{x^{k_{j}}\}^{\infty}_{k=0}\) of line search successes such that \(x^{k_{j}} \to\bar{x}\). Following the arguments of Theorem 2.8, we have \(|d^{k_{j}}| \to0\) and \(\Delta^{k_{j}} \downarrow 0\). Notice that if \(\Delta^{k_{j}} \downarrow0\), then eventually \(Y^{k_{j}} \subseteq B_{\tilde{\varepsilon}} (\bar{x})\), where \(x^{k_{j}} \to\bar{x}\) and \(\tilde{\varepsilon}\) is defined as in Lemma 3.4. Thus, by Lemma 3.4, we have that \(A(Y^{k_{j}}) \subseteq A(\bar{x})\). This means that \(G_{Y^{k_{j}}} (x^{k_{j}})\) is formed from a subset of the approximated active gradients, related to \(\partial f(\bar{x})\). Thus, by Lemma 2.1(1), as \(-d^{k_{j}} \in G_{Y^{k_{j}}}(x^{k_{j}})\), we can construct a \(v^{k_{j}} \in\partial f(\bar{x})\) from the same set of approximated active gradients, related to \(G_{Y^{k_{j}}}(x^{k_{j}})\), such that
Using the Triangle Inequality, we have that
We already showed that \(|d^{k_{j}}| \to0\) and \(\Delta^{k_{j}} \downarrow 0\). Furthermore, since ∇f i is continuous and \(x^{k_{j}} \to\bar {x}\), we have \(| \nabla f_{i}(x^{k_{j}}) - \nabla f_{i}(\bar{x}) | \to0\). So,
Using the same arguments as in Theorem 2.8, the result follows.
Case 2: A finite number of line search successes occur.
This means there exists a \(\bar{k}\) such that \(x^{k} = x^{\bar{k}}=\bar{x}\) for all \(k \ge\bar{k}\). However, by Corollary 3.3, if \(0 \notin\partial f(\bar{x})\), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have \(0 \in\partial f(\bar{x})\). This implies Δk↓0 and |d k|→0, as in the proof of Theorem 2.8. □
Remark 3.6
Using d k to check our stopping conditions allows the result of Theorem 2.9 to still hold.
3.3 Robust stopping with Goldstein approximate subdifferential
We want to provide some insight as to how Theorem 2.9 can work for stopping conditions based on \(d^{k}_{Y}\), that is, replacing the stopping conditions Δk≤μ k|d k| and |d k|<ε tol in Step 2 with the robust stopping conditions
In the situation when the algorithm terminates, the following proposition does not theoretically justify why the stopping conditions are sufficient, but does help explain their logic. Theoretically, since we do not know what \(\bar{x}\) is, we cannot tell when \(A(Y) \subseteq A(\bar{x})\). However, as seen above, we do know that if \(x^{k} \rightarrow\bar{x}\), then eventually \(A(Y^{k}) \subseteq A(\bar{x})\).
Proposition 3.7
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists a \(\bar{K}> 0\) such that \(|\nabla_{A} f_{i}(x) - \nabla f_{i}(x)| \le\bar{K} \Delta^{k}\) for all i=1,…,N and for all \(x \in B_{\Delta^{k}} (x^{k})\). Suppose the RAGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 using the robust stopping conditions given in (16). Furthermore, suppose there exists \(\bar{x} \in B_{\Delta^{\bar{k}}} (x^{\bar{k}})\) such that \(A(Y^{\bar{k}}) \subseteq A(\bar{x})\). Then
Proof
If \(A(Y^{\bar{k}}) \subseteq A(\bar{x})\), then the proofs of Lemma 2.1(1) and Theorem 2.9 still hold. □
Additionally, in the following results, we approach the theory for robust stopping conditions using the Goldstein approximate subdifferential. If the RAGS algorithm terminates in Step 2, then it is shown that the distance between 0 and the Goldstein approximate subdifferential is controlled by ε tol . Again, this does not prove the robust stopping conditions are sufficient for the exact subdifferential.
First, the Goldstein approximate subdifferential, as defined in [17], is given by the set
We now show that the Goldstein approximate subdifferential contains all of the gradients of the active functions in the robust active set.
Lemma 3.8
Let f=max{f i :i=1,…,N}. Let Y=[y 0,y 1,y 2,…,y m] be a randomly sampled set from a ball centered at \(y^{0}=\bar{x}\) with radius Δ. If \(f_{i} \in\mathcal{C}^{1}\) for each i, then
Proof
If \(f_{i} \in\mathcal{C}^{1}\) for each i∈A(Y), then by (2), for each y j∈Y we have
Using this in our definition of the Goldstein approximate subdifferential in (17) and knowing \(B_{\Delta}(\bar{x}) \supseteq Y\), we have
which simplifies to
□
Now we have a result similar to Lemma 2.1(1) for \(d^{k}_{Y}\) with respect to the Goldstein approximate subdifferential.
Remark 3.9
For the following two results, we assume each of the \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L. Note that this implies the Lipschitz constant L is independent of i. If each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L i , then L is easily obtained by L=max{L i :i=1,…,N}.
Lemma 3.10
Let f=max{f i :i=1,…,N} where \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L. Let Y=[y 0,y 1,y 2,…,y m] be a randomly sampled set from a ball centered at \(y^{0}=\bar{x}\) with radius Δ. Suppose there exists a \(\bar{K} >0\) such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le\bar{K} \Delta \). Then for all \(w \in G_{Y}(\bar{x})\), there exists a \(g \in\partial_{\Delta}^{G} f(\bar{x})\) such that
Proof
By definition, for all \(w \in G_{Y}(\bar{x})\) there exists a set of α i such that
By our assumption that each \(f_{i} \in\mathcal{C}^{1+}\), Lemma 3.8 holds. It is clear that for each i∈A(Y), i∈A(y j) for some y j∈Y. Let j i be the index corresponding to this active index; i.e., \(i \in A(y^{j_{i}})\). Thus, for each i∈A(Y), there is a corresponding active gradient
Using the same α i as above, we can construct
Then
□
Thus, using Lemma 3.10, we can show that if the algorithm stops due to the robust stopping conditions, then the distance from 0 to the Goldstein approximate subdifferential is controlled by ε tol .
Proposition 3.11
Let f=max{f i :i=1,…,N} where each \(f_{i} \in~\mathcal {C}^{1+}\) with Lipschitz constant L. Suppose there exists a \(\bar{K} > 0\) such that for each iteration k, the approximate gradient satisfies \(|\nabla_{A} f_{i}(x^{k}) - \nabla f_{i}(x^{k})| \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Suppose the RAGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 using the robust stopping conditions given in (16). Then
Proof
Let \(\bar{w} = \operatorname{Proj}(0|G_{Y^{\bar{k}}}(x^{\bar {k}}))\). We use \(\bar {g} \in\partial^{G}_{\Delta^{\bar{k}}} f(x^{\bar{k}})\) as constructed in Lemma 3.10 to see that
The statement now follows by the test \(\Delta^{\bar{k}} \le \mu^{\bar{k}} |d^{\bar{k}}_{Y}|\) in Step 2 and the fact that \(\mu^{\bar{k}} \le\mu^{0}\) as {μ k} k=0 is a non-increasing sequence. □
4 Approximate gradients
As seen in the previous two sections, in order for convergence to be guaranteed in the AGS or RAGS algorithm, the approximate gradient used must satisfy an error bound for each of the active f i . Specifically, there must exist a \(\bar{K}>0\) such that
where Δk=max j |y j−x k|. In this section, we present three specific approximate gradients that satisfy the above requirement: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function.
4.1 Simplex gradient
The simplex gradient is a commonly used approximate gradient. In recent years, several derivative-free algorithms have been proposed that use the simplex gradient ([5, 11, 12, 22] and [19] among others). It is geometrically defined as the gradient of the linear interpolation of f over a set of n+1 points in ℝn. Mathematically, we define it as follows.
Let Y=[y 0,y 1,…,y n] be a set of affinely independent points in ℝn. We say that Y forms the simplex \(S=\operatorname{conv}\{ Y\}\). Thus, S is a simplex if it can be written as the convex hull of an affinely independent set of n+1 points in ℝn.
The simplex gradient of a function f over the set Y is given by
where
and
The condition number of the simplex formed by Y is given by \(\| \hat{L}^{-1}\|\), where
4.1.1 Convergence
The following result (by Kelley [23]) shows that there exists an appropriate error bound between the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz constant used in the following theorem corresponds to ∇f i .
Theorem 4.1
Consider \(f_{i} \in\mathcal{C}^{1+}\) with Lipschitz constant K i for ∇f i . Let Y=[y 0,y 1,…,y n] form a simplex. Let
Then the simplex gradient satisfies the error bound
where \(\bar{K} = \frac{1}{2} K_{i} \sqrt{n} \| \hat {L}^{-1} \|\).
Proof
See [23, Lemma 6.2.1]. □
With the above error bound result, we conclude that convergence holds when using the simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.
Corollary 4.2
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant K i for ∇f i . If the approximate gradient used in the AGS or RAGS algorithm is the simplex gradient and \(\|\hat{L}^{-1}\|\) is bounded above for each simplex gradient computed, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.
4.1.2 Algorithm—simplex gradient
In order to calculate a simplex gradient in Step 1, we generate a set Y=[x k,y 1,…,y n] of points in ℝn and then check to see if Y forms a well-poised simplex by calculating its condition number, \(\|\hat{L}^{-1}\|\). A bounded condition number (\(\| \hat {L}^{-1}\| < n\)) ensures a ‘good’ error bound between the approximate gradient and the exact gradient.
If the set Y does not form a well-poised simplex (\(\|\hat{L}^{-1}\| \ge n\)), then we resample. If Y forms a well-poised simplex, then we calculate the simplex gradient of f i over Y for each i∈A(x k) and then set the approximate subdifferential equal to the convex hull of the active simplex gradients. We note that the probability of generating a random matrix with a condition number greater than n is asymptotically constant [38]. Thus, randomly generating simplices is a quick and practical option. Furthermore, notice that calculating the condition number does not require function evaluations; thus, resampling does not affect the number of function evaluations required by the algorithm.
4.2 Centered simplex gradient
The centered simplex gradient is the average of two simplex gradients. Although it requires more function evaluations, it contains an advantage that the error bound satisfied by the centered simplex gradient is in terms of Δ2, rather than Δ.
Let Y=[y 0,y 1,…,y n] form a simplex. We define the sets
and
where x=y 0 and \(\tilde{y}^{i} = y^{i} - y^{0}\) for i=1,…,n. The centered simplex gradient is the average of the two simplex gradients over the sets Y + and Y −, i.e.,
4.2.1 Convergence
To show that the AGS and RAGS algorithms are both well-defined when using the centered simplex gradient as an approximate gradient, we provide an error bound between the centered simplex gradient and the exact gradient (again by Kelley [23]).
Theorem 4.3
Consider \(f_{i} \in\mathcal{C}^{2+}\) with Lipschitz constant K i for ∇2 f i . Let Y=[y 0,y 1,…,y n] form a simplex. Let
Then the centered simplex gradient satisfies the error bound
where \(\bar{K} = K_{i} \sqrt{n} \|\hat{L}^{-1}\|\).
Proof
See [23, Lemma 6.2.5]. □
Notice that Theorem 4.3 requires \(f_{i} \in\mathcal {C}^{2+}\). If \(f_{i} \in\mathcal{C}^{1+}\), then the error bound is in terms of Δ, not Δ2. With the above error bound result, we conclude that convergence holds when using the centered simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.
Corollary 4.4
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{2+}\) with Lipschitz constant K i for ∇2 f i . If the approximate gradient used in the AGS or RAGS algorithm is the centered simplex gradient, \(\|\hat{L}^{-1}\|\) is bounded above for each centered simplex gradient computed and Δ0≤1, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.
Proof
Since Δ0≤1 and non-increasing, (Δk)2≤Δk and ergo, Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold. □
4.2.2 Algorithm
In Step 1, to adapt the AGS or RAGS algorithm to use the centered simplex gradient, we sample our set Y in the same manner as for the simplex gradient (resampling until a well-poised set is achieved). We then form the sets Y + and Y − and proceed as expected.
4.3 Gupal estimate
The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [24] uses the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient. We see in Theorem 4.8 that an appropriate error bound exists for this approximate gradient. Surprisingly, unlike the error bounds for the simplex and centered simplex gradients, the error bound in Theorem 4.8 does not include a condition number term. Mathematically, we define the Gupal estimate of the gradient of the Steklov averaged function as follows.
For α>0, the Steklov averaged function f α , as defined in [16, Def. 3.1], is given by
where ψ α :ℝn→ℝ+ is the Steklov mollifier defined by
We can equivalently define the Steklov averaged function by
The partial derivatives of f α are given by ([16, Prop. 3.11] and [18])
for i=1,…,n, where \(\mathbb{B}_{\infty}= [-1/2, 1/2]^{n}\) is the unit cube centred at 0 and
Given α>0 and \(z=(\zeta^{1}, \ldots, \zeta^{n}) \in\prod^{n}_{i=1} \mathbb{B}_{\infty}\), the Gupal estimate of ∇f α (x) over z is given by
Remark 4.5
Although we define γ(x,α,z) as the Gupal estimate of ∇f α (x), in Sect. 4.3.1, we will show that γ(x,α,z) provides a good approximation to the exact gradient, ∇f(x).
Remark 4.6
For the following results, we note that the α used in the above definitions is equivalent to our sampling radius Δ. Thus, we will be replacing α with Δ in the convergence results in Sect. 4.3.1.
4.3.1 Convergence
As before, in order to show that both the AGS and RAGS algorithms are well-defined when using the Gupal estimate as an approximate gradient, we must establish that it provides a good approximate of our exact gradient. To do this, we first need the following classic result from [13].
Lemma 4.7
[13, Lemma 4.1.12]
Let \(f \in\mathcal{C}^{1+}\) with Lipschitz constant K for ∇f. Let y 0∈ℝn. Then for any y∈ℝn
Using Lemma 4.7, we establish an error bound between the Gupal estimate and the exact gradient of f.
Theorem 4.8
Consider \(f_{i} \in\mathcal{C}^{1+}\) with Lipschitz constant K i for ∇f i . Let ε>0. Then for Δ>0, \(z = ( \zeta ^{1}, \ldots, \zeta^{n} ) \in Z = \prod^{n}_{i=1} \mathbb{B}_{\infty}\) and any point x∈ℝn, the Gupal estimate of ∇f i,Δ(x) satisfies the error bound
Proof
For Δ>0, let
and
Applying Lemma 4.7, we have that
From (21) (with α=Δ), we can see that
Hence, (23) becomes
From our definitions of y j− and y j+, we can see that
The inner product in (24) simplifies to
Thus, we have
implying
Also notice that
Using the standard basis vector e j, we have
Thus, since \(f_{i} \in\mathcal{C}^{1+}\), we have
Noting that
we have
Finally,
□
We conclude that convergence holds when using the Gupal estimate of the gradient of the Steklov averaged function of f as an approximate gradient in both the AGS and RAGS algorithms.
Corollary 4.9
Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant K i for ∇f i . If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate of the gradient of the Steklov averaged function, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.
4.3.2 Algorithm
To use the Gupal estimate of the gradient of the Steklov averaged function in both the AGS and RAGS algorithms, in Step 1, we sample independently and uniformly \(\{z^{kl}\}^{m}_{l=1}\) from the unit cube in ℝn×n, respectively, where m is the number of active functions.
5 Numerical results
5.1 Versions of the AGS and RAGS algorithms
We implemented the AGS and RAGS algorithms using the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as approximate gradients.
Additionally, we used the robust descent direction to create robust stopping conditions. That is, the algorithm terminates when
where \(d^{k}_{Y}\) is the projection of 0 onto the approximate subdifferential generated using the robust active set (see Proposition 3.11 for results linking the robust stopping conditions with the Goldstein approximate subdifferential). The implementation was done in MATLAB (v. 7.11.0.584, R2010b). Software is available by contacting the corresponding author.
Let d k denote the regular descent direction and let \(d^{k}_{Y}\) denote the robust descent direction. There are three scenarios that could occur when using the robust stopping conditions:
-
1.
\(|d^{k}| = |d^{k}_{Y}|\);
-
2.
\(|d^{k}| \ge|d^{k}_{Y}|\), but checking the stopping conditions leads to the same result (line search, radius decrease or termination); or
-
3.
\(|d^{k}| \ge|d^{k}_{Y}|\), but checking the stopping conditions leads to a different result.
In Scenarios 1 and 2, the robust stopping conditions have no influence on the algorithm. In Scenario 3, we have two cases:
-
1.
\(\Delta^{k} \le\mu^{k} |d^{k}_{Y}| \le\mu^{k} |d^{k}|\), but |d k|≥ε tol and \(|d^{k}_{Y}| < \varepsilon_{tol}\) or
-
2.
Δk≤μ k|d k| holds, but \(\Delta^{k} > \mu^{k} |d^{k}_{Y}|\).
Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS algorithms to do one of two things: to terminate early, providing a solution that has a smaller quality measure, but requires less function evaluations to find, or to reduce its sampling radius instead of carrying out a line search, reducing the number of function evaluations carried out during that iteration and calculating a more accurate approximate subdifferential at the next iteration.
Our goal in this testing is to determine if there are any notable numerical differences in the quality of the three approximate gradients (simplex, centered simplex, and Gupal estimate), the two search directions (robust and non-robust), and the two stopping conditions (robust and non-robust). This leads to the following 12 versions:
-
AGS Simplex (1. non-robust/2. robust stopping)
-
RAGS Simplex (3. non-robust/4. robust stopping)
-
AGS Centered Simplex (5. non-robust/6. robust stopping)
-
RAGS Centered Simplex (7. non-robust/8. robust stopping)
-
AGS Gupal (9. non-robust/10. robust stopping)
-
RAGS Gupal (11. non-robust/12. robust stopping)
5.2 Test sets and software
Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro. We used the test set from Lukšan-Vlček [27]. The first 25 problems presented are of the desired form
Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued. Thus, our test set presents a total of 24 finite minimax problems with dimensions from 2 to 20. There are several problems with functions f i that have the form f i =|f i |, where f i is a smooth function. We rewrote these functions as f i =max{f i ,−f i }. The resulting test problems have from 2 to 130 sub-functions. A summary of the test problems appears in Table 2 in the Appendix.
5.3 Initialization and stopping conditions
We first describe our choices for the initialization parameters used in the AGS and RAGS algorithms.
The initial starting points are given for each problem in [27]. We set the initial accuracy measure to 0.5 with a reduction factor of 0.5. We set the initial sampling radius to 0.1 with a reduction factor of 0.5. The Armijo-like parameter η was chosen to be 0.1 to ensure that a line search success resulted in a significant function value decrease. We set the minimum step length to 10−10.
Next, we discuss the stopping tolerances used to ensure finite termination of the AGS and RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The first is our theoretical stopping condition, while the remaining three are to ensure numerical stability of the algorithm.
-
1.
Stopping conditions met—As stated in the theoretical algorithm, the algorithm terminates for this reason when Δk≤μ k|d k| and |d k|<ε tol , where d k is either the regular or the robust descent direction.
-
2.
Hessian matrix has NaN/Inf entries—For the solution of the quadratic program in Step 2, we use the quadprog command in MATLAB, which has certain numerical limitations. When these limitations result in NaN or Inf entries in the Hessian, the algorithm terminates.
-
3.
Δk, μ k, and |d k| are small—This stopping criterion bipasses the test Δk≤μ k|d k| (in Step 2) and stops if Δk<Δtol, |μ k|<μ tol and |d k|<ε tol . Examining Theorem 2.9 along with Theorems 4.1, 4.3 and 4.8, it is clear that this is also a valid stopping criterion. We used a bound of 10−6 in our implementation for both Δk and μ k.
-
4.
Max number of function evaluations reached—As a final failsafe, we added an upper bound of 106 on the number of function evaluations allowed. (This stopping condition only occurs once in our results.)
5.4 Results
Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each version. For each of the 25 trials, we record the number of function evaluations, the number of iterations, the solution, the quality of the solution and the reason for termination. The quality was measured by the improvement in the number of digits of accuracy, which is calculated using the formula
where F min is the function value at the final (best) iterate, F ∗ is the true minimum value (optimal value) of the problem (as given in [27]) and F 0 is the function value at the initial iterate. Results on function evaluations and solution quality appear in Tables 3, 4 and 5 of the Appendix.
To visually compare algorithmic versions, we use performance profiles. A performance profile is the (cumulative) distribution function for a performance metric [14]. For the AGS and RAGS algorithms, the performance metric is the ratio of the number of function evaluations taken by the current version to successfully solve each test problem versus the least number of function evaluations taken by any of the versions to successfully solve each test problem. Performance profiles eliminate the need to discard failures in numerical results and provide a visual representation of the performance difference between several solvers. For full details on the construction of performance profiles, see [14].
In Figs. 1(a) and 1(b) we include a performance profile showing all 12 versions of the AGS and RAGS algorithms tested, declaring a success for 1 digit and 3 digits of improvement, respectively.
In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient does not produce the best results. It only produces 1 or more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is no significant difference between the performance of the AGS and RAGS algorithms using the simplex and centered simplex gradients as approximate gradients.
Looking at the results in Tables 3, 4 and 5, and our performance profiles, we can make the following two observations:
-
(i)
the versions of the RAGS algorithm generally outperform (converge faster than) the versions of the AGS algorithm, and
-
(ii)
the RAGS algorithm using the robust stopping conditions terminates faster and with lower (but still significant) accuracy.
Robust active set: From our results, it is clear that expanding the active set to include ‘almost active’ functions in the RAGS algorithm greatly improves performance for the simplex and centered simplex algorithm. This robust active set brings more local information into the approximate subdifferential and thereby allows for descent directions that are more parallel to any nondifferentiable ridges formed by the function.
Robust stopping conditions: We notice from the performance profiles that in terms of function evaluations, the robust stopping conditions improve the overall performance of the RAGS algorithm, although they decrease the average accuracy on some problems. These results correspond with our previously discussed hypothesis. Furthermore, upon studying the reasons for termination, it appears that the non-robust stopping conditions cause the AGS and RAGS algorithms to terminate mainly due to Δk and μ k becoming too small. For the robust stopping conditions, the RAGS algorithm terminated often because the stopping conditions were satisfied. As our theory in Sect. 3.3 is not complete, we cannot make any theoretical statements about how the robust stopping conditions would perform in general (like those in Theorem 2.9). However, from our results, we conjecture that the alteration is beneficial for decreasing function evaluations.
In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the RAGS algorithm either matches or outperforms the AGS algorithm in average accuracy obtained over 25 trials using the simplex and centered simplex gradients. Knowing this, we conclude that the improvement of the accuracy is due to the choice of a robust search direction.
5.5 Comparison to NOMAD
NOMAD is a well established general DFO solver, not specialized for the class of minimax problems [1, 25]. Using the same set of 24 test problems as we did with our previous tests, we compare the RAGS algorithm to the NOMAD algorithm. Based on the previous tests we use the simplex gradient with robust stopping conditions. All parameters for the NOMAD algorithm we left to default settings. The performance profile generated by these two algorithms for the improvement of 3 digits of accuracy appears in Fig. 2.
As seen in Fig. 2, NOMAD appears slightly faster than RAGS, but slightly less robust. Overall, the two algorithms are fairly comparable. Further research into RAGS may lead to algorithmic improvement. For example, a more informed selection of the sample set Y in Step 1, an improved line search, or better selection of default parameters for RAGS, could all lead to improved convergence. We leave such research for future study.
6 Conclusion
We have presented a new derivative-free algorithm for finite minimax problems that exploits the smooth substructure of the problem. Convergence results are given for any arbitrary approximate gradient that satisfies an error bound dependent on the sampling radius. Three examples of such approximate gradients are given. Additionally, a robust version of the algorithm is presented and shown to have the same convergence results as the regular version.
Through numerical testing, we find that the robust version of the algorithm outperforms the regular version with respect to both the number of function evaluations required and the accuracy of the solutions obtained. Additionally, we tested robust stopping conditions and found that they generally required less function evaluations before termination. Overall, the robust stopping conditions paired with the robust version of the algorithm performed best (as seen in the performance profiles).
Considerable future work is available in this research direction. Most obvious is an exploration of the theory behind the performance of the robust stopping conditions. Another direction lies in the theoretical requirement bounding the step length away from 0 (see Theorems 2.8 and 3.5). In gradient based methods, one common way to avoid this requirement is with the use of Wolfe-like conditions. We are unaware of any derivative-free variant on the Wolfe conditions.
References
Abramson, A.M., Audet, C., Couture, G., Dennis, J.E. Jr., Le Digabel, S., Tribes, C.: The NOMAD project. Software available at http://www.gerad.ca/nomad
Bagirov, A.M., Karasözen, B., Sezer, M.: Discrete gradient method: derivative-free method for nonsmooth optimization. J. Optim. Theory Appl. 137(2), 317–334 (2008)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, New York (2011)
Booker, A.J., Dennis, J.E. Jr., Frank, P.D., Serafini, D.B., Torczon, V.: Optimization using surrogate objectives on a helicopter test example. In: Computational Methods for Optimal Design and Control, Arlington, VA, 1997. Progr. Systems Control Theory, vol. 24, pp. 49–58. Birkhäuser, Boston (1998)
Bortz, D.M., Kelley, C.T.: The simplex gradient and noisy optimization problems. In: Computational Methods for Optimal Design and Control. Progr. Systems Control Theory, vol. 24, pp. 77–90. Birkhäuser, Boston (1998)
Burke, J.V., Lewis, A.S., Overton, M.L.: Approximating subdifferentials by random sampling of gradients. Math. Oper. Res. 27(3), 567–584 (2002)
Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2005)
Cai, X., Teo, K., Yang, X., Zhou, X.: Portfolio optimization under a minimax rule. Manag. Sci. 46(7), 957–972 (2000)
Clarke, F.H.: Optimization and Nonsmooth Analysis, 2nd edn. Classics Appl. Math., vol. 5. SIAM, Philadelphia (1990)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. MPS/SIAM Series on Optimization, vol. 8. SIAM, Philadelphia (2009)
Custódio, A.L., Dennis, J.E. Jr., Vicente, L.N.: Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal. 28(4), 770–784 (2008)
Custódio, A.L., Vicente, L.N.: Using sampling and simplex derivatives in pattern search methods. SIAM J. Optim. 18(2), 537–555 (2007)
Dennis, J.E. Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Classics in Applied Mathematics. SIAM, Philadelphia (1996)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2, Ser. A), 201–213 (2002)
Duvigneau, R., Visonneau, M.: Hydrodynamic design using a derivative-free method. Struct. Multidiscip. Optim. 28, 195–205 (2004)
Ermoliev, Y.M., Norkin, V.I., Wets, R.J.-B.: The minimization of semicontinuous functions: mollifier subgradients. SIAM J. Control Optim. 33, 149–167 (1995)
Goldstein, A.A.: Optimization of Lipschitz continuous functions. Math. Program. 13(1), 14–22 (1977)
Gupal, A.M.: A method for the minimization of almost differentiable functions. Kibernetika 1, 114–116 (1977) (in Russian); English translation in: Cybernetics, 13(2), 220–222 (1977)
Hare, W., Macklem, M.: Derivative-free optimization methods for finite minimax problems. Optim. Methods Softw. 28(2), 300–312 (2013)
Hare, W.L.: Using derivative free optimization for constrained parameter selection in a home and community care forecasting model. In: International Perspectives on Operations Research and Health Care, Proceedings of the 34th Meeting of the EURO Working Group on Operational Research Applied to Health Sciences, pp. 61–73 (2010)
Imae, J., Ohtsuki, N., Kikuchi, Y., Kobayashi, T.: A minimax control design for nonlinear systems based on genetic programming: Jung’s collective unconscious approach. Int. J. Syst. Sci. 35, 775–785 (2004)
Kelley, C.T.: Detection and remediation of stagnation in the Nelder–Mead algorithm using a sufficient decrease condition. SIAM J. Optim. 10(1), 43–55 (1999)
Kelley, C.T.: Iterative Methods for Optimization. Frontiers in Applied Mathematics, vol. 18. SIAM, Philadelphia (1999)
Kiwiel, K.C.: A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Optim. 20(4), 1983–1994 (2010)
Le Digabel, S.: Algorithm 909: NOMAD: nonlinear optimization with the MADS algorithm. ACM Trans. Math. Softw. 37(4), 1–15 (2011)
Liuzzi, G., Lucidi, S., Sciandrone, M.: A derivative-free algorithm for linearly constrained finite minimax problems. SIAM J. Optim. 16(4), 1054–1075 (2006)
Lukšan, L., Vlček, J.: Test Problems for Nonsmooth Unconstrained and Linearly Constrained Optimization. Technical report (February 2000)
Madsen, K.: Minimax solution of non-linear equations without calculating derivatives. Math. Program. Stud. 3, 110–126 (1975)
Marsden, A.L., Feinstein, J.A., Taylor, C.A.: A computational framework for derivative-free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Eng. 197(21–24), 1890–1905 (2008)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research. Springer, New York (1999)
Di Pillo, G., Grippo, L., Lucidi, S.: A smooth method for the finite minimax problem. Math. Program., Ser. A 60(2), 187–214 (1993)
Polak, E.: On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Rev. 29(1), 21–89 (1987)
Polak, E., Royset, J.O., Womersley, R.S.: Algorithms with adaptive smoothing for finite minimax problems. J. Optim. Theory Appl. 119(3), 459–484 (2003)
Polyak, R.A.: Smooth optimization methods for minimax problems. SIAM J. Control Optim. 26(6), 1274–1286 (1988)
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 317. Springer, Berlin (1998)
Stafford, R.: Random Points in an n-Dimensional Hypersphere. MATLAB File Exchange (2005). http://www.mathworks.com/matlabcentral/fileexchange/9443-random-points-in-an-n-dimensional-hypersphere
Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. Stud. 3, 145–173 (1975)
Wschebor, M.: Smoothed analysis of κ(A). J. Complex. 20(1), 97–107 (2004)
Xu, S.: Smoothing method for minimax problems. Comput. Optim. Appl. 20(3), 267–279 (2001)
Acknowledgements
The authors would like to express their gratitude to Dr. C. Sagastizábal for inspirational conversations regarding the Goldstein subdifferential. This research was partially funded by the NSERC DG program and by UBC IRF.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Hare, W., Nutini, J. A derivative-free approximate gradient sampling algorithm for finite minimax problems. Comput Optim Appl 56, 1–38 (2013). https://doi.org/10.1007/s10589-013-9547-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-013-9547-6