1 Introduction

In this paper we consider the finite minimax problem:

$$\min_x f(x)\quad \text{where } f(x)=\max\bigl\{ f_i(x) : i = 1, \ldots, N \bigr\}, $$

where each individual f i is continuously differentiable. We further restrict ourselves to the field of derivative-free optimization (DFO), where we are only permitted to compute function values, i.e., we cannot compute gradient values ∇f i directly. We present a derivative-free algorithm that exploits the smooth substructure of the finite max problem, thereby creating a robust algorithm with an elegant convergence theory.

Finite minimax problems occur in numerous applications, such as portfolio optimization [8], control system design [21], engineering design [32], and determining the cosine measure of a positive spanning set [10, Def. 2.7]. In a finite max function, although each individual f i may be smooth, taking the maximum forms a nonsmooth function with ‘nondifferentiable ridges’. For this reason, most algorithms designed to solve finite minimax problems employ some form of smoothing technique; [31, 33, 34] and [39] (among many others). In general, these smoothing techniques require gradient calculations.

However, in many situations gradient information is not available or can be difficult to compute accurately (see [4, 15, 20, 29] and [10, Chap. 1] for some examples of such situations). Such situations are considered by research in the area of derivative-free optimization. For a thorough introduction to several basic DFO frameworks and convergence results for each, see [10].

Research on optimizing finite max functions without calculating derivatives can be seen as early as 1975 [28], while more recently we have seen a resurface in this area [26] and [19].

In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential penalty function in a directional direct-search framework to form a derivative-free optimization method for finite minimax problems [26]. This method is shown to globally converge towards a standard stationary point of the original finite minimax problem.

Also specific to the finite minimax problem, a derivative-free method is presented in [19] that exploits the smooth substructure of the problem. It combines the frameworks of a directional direct search method [10, Chap. 7] and the gradient sampling algorithm (GS algorithm) presented in [6] and [7]. Loosely speaking, the GS algorithm uses a collection of local gradients to build a ‘robust subdifferential’ of the objective function and uses this to determine a ‘robust descent direction’. In [19], these ideas are used to develop several methods to find an approximate descent direction that moves close to parallel to an ‘active manifold’. During each iteration, points are sampled from around the current iterate and the simplex gradient is calculated for each of the active functions of the objective function. The calculated simplex gradients are then used to form an approximate subdifferential, which is then used to determine a likely descent direction.

Ideas from the GS algorithm have appeared in two other recent DFO methodologies [2] and [24].

In 2008, Bagirov, Karasözen and Sezer presented a discrete gradient derivative-free method for unconstrained nonsmooth optimization problems [2]. Described as a derivative-free version of the bundle method presented in [37], the method uses discrete gradients to approximate subgradients of the function and build an approximate subdifferential. The analysis of this method provides proof of convergence to a Clarke stationary point for an extensive class of nonsmooth problems. In this paper, we focus on the finite minimax problem. This allows us to require few (other) assumptions on our function while maintaining strong convergence analysis. It is worth noting that we use the same set of test problems as in [2]. Specifically, we use the [27] test set and exclude one problem as its sub-functions are complex-valued. (The numerics in [2] exclude the same problem, and several others, without explanation.)

Using approximate gradient calculations instead of gradient calculations, the GS algorithm is made derivative free by Kiwiel in [24]. Specifically, Kiwiel employs the Gupal estimate of the gradient of the Steklov averaged function (see [18] or Sect. 4.3 herein) as an approximate gradient. It is shown that, with probability 1, this derivative-free algorithm satisfies the same convergence results as the GS algorithm—it either drives the f-values to −∞ or each cluster point is found to be Clarke stationary [24, Theorem 3.8]. No numerical results are presented for Kiwiel’s derivative-free algorithm.

In this paper, we use the GS algorithm framework with approximate gradients to form a derivative-free approximate gradient sampling algorithm. As we are dealing with finite max functions, instead of calculating an approximate gradient at each of the sampled points, we calculate an approximate gradient for each of the active functions. Expanding the active set to include ‘almost’ active functions, we also present a robust version of our algorithm, which is more akin to the GS algorithm. In this robust version, when our iterate is close to a point of nondifferentiability, the size and shape of our approximate subdifferential will reflect the presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdifferential, the descent direction will direct minimization parallel to a ‘nondifferentiable ridge’, rather than straight at this ridge. It can be seen in our numerical results that these robust changes greatly influence the performance of our algorithm.

Our algorithm differs from the above in a few key manners. Unlike in [26] we do not employ a smoothing technique. Unlike in [19], which uses the directional direct-search framework to imply convergence, we employ an approximate steepest descent framework. Using this framework, we are able to analyze convergence directly and develop stopping conditions for the algorithm. Unlike in [2] and [24], where convergence is proven for a specific approximate gradient, we prove convergence for any approximate gradient that satisfies a simple error bound dependent on the sampling radius. As examples, we present the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. (As a side-note, Sect. 4.3 also provides, to the best of the authors’ knowledge, novel error analysis of the Gupal estimate of the gradient of the Steklov averaged function.)

Focusing on the finite minimax problem provides us with an advantage over the methods of [2] and [24]. In particular, we only require order n function calls per iteration (where n is the dimension of the problem), while both [2] and [24] require order mn function calls per iteration (where m is the number of gradients they approximate to build their approximate subdifferential). (The original GS algorithm suggests that m≈2n provides a good value for m.)

The remainder of this paper is organized as follows. In Sect. 2, we present the approximate gradient sampling algorithm (AGS algorithm) and our convergence analysis. In Sect. 3, we present a robust version of the AGS algorithm (RAGS algorithm), which uses ‘almost active’ functions in the calculation of the approximate subdifferential. In Sect. 4, we show that the AGS and RAGS algorithms converge using three specific approximate gradients: simplex gradient, centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. Finally, in Sect. 5, we present our numerical results and analysis.

2 Approximate gradient sampling algorithm

Throughout this paper, we assume that our objective function is of the form

$$ \min_x f(x)\quad \text{where } f(x)=\max \bigl\{ f_i(x) : i = 1, \ldots, N \bigr\}, $$
(1)

where each \(f_{i} \in\mathcal{C}^{1}\), but we cannot compute ∇f i . We use \(\mathcal{C}^{1}\) to denote the class of differentiable functions whose gradient mapping ∇f is continuous. We denote by \(\mathcal{C}^{1+}\) the class of continuously differentiable functions whose gradient mapping ∇f is locally Lipschitz and we denote by \(\mathcal{C}^{2+}\) the class of twice continuously differentiable functions whose Hessian mapping ∇2 f is locally Lipschitz. Additionally, throughout this paper, |⋅| denotes the Euclidean norm and ∥⋅∥ denotes the corresponding matrix norm.

For the finite max function in (1), we define the active set of f at a point \(\bar{x}\) to be the set of indices

$$A(\bar{x})=\bigl\{ i : f(\bar{x})=f_i (\bar{x}) \bigr\}. $$

The set of active gradients of f at \(\bar{x}\) is denoted by

$$\bigl\{ \nabla f_i(\bar{x})\bigr\}_{i \in A(\bar{x})}. $$

Let f be locally Lipschitz at a point \(\bar{x}\). As f is Lipschitz, there exists an open dense set D⊂ℝn such that f is continuously differentiable on D. The Clarke subdifferential [9] is constructed via

$$\partial f(x) = \bigcap_{\varepsilon> 0} G_\varepsilon(x) \quad \mbox{where } G_\varepsilon(x) = \operatorname{cl}\operatorname {conv} \bigl\{ \nabla f(y) : y \in B_\varepsilon(x) \cap D\bigr\}. $$

For a finite max function, assuming \(f_{i} \in\mathcal{C}^{1}\) for each \(i \in A(\bar{x})\), the Clarke subdifferential (as proven in [9, Prop. 2.3.12]) is equivalent to

$$ \partial f(\bar{x})= \operatorname{conv}\bigl\{ \nabla f_i(\bar{x})\bigr\}_{i \in A(\bar{x})}. $$
(2)

By (2), it is clear that for finite max functions the subdifferential is a compact set. This will be important in the convergence analysis in Sect. 2.2.

We are now ready to state the general form of the AGS algorithm, an approximate subgradient descent method.

2.1 Algorithm—AGS

We first provide a partial glossary of notation used in the definition of the AGS algorithm (see Table 1).

Table 1 Glossary of notation used in the AGS algorithm

Conceptual Algorithm: [Approximate Gradient Sampling Algorithm]

  1. 0.

    Initialize: Set k=0 and input

    • x 0—starting point

    • μ 0>0—accuracy measure

    • Δ0>0—initial sampling radius

    • θ∈(0,1)—sampling radius reduction factor

    • 0<η<1—Armijo-like parameter

    • t min—minimum step length

    • ε tol >0—stopping tolerance

  2. 1.

    Generate Approximate Subdifferential G k:

    Generate a set Y=[x k,y 1,…,y m] around the current iterate x k such that

    $$\max_{j=1, \ldots, m} \bigl|y^j - x^k\bigr| \le \Delta^k. $$

    Use Y to calculate the approximate gradient of f i , denoted ∇ A f i , at x k for each iA(x k). Set

    $$G^k = \operatorname{conv}\bigl\{ \nabla_A f_i \bigl(x^k\bigr)\bigr\}_{i \in A(x^k)}. $$
  3. 2.

    Generate Search Direction:

    Let

    $$d^k = -\operatorname{Proj}\bigl(0|G^k\bigr). $$

    Check if

    $$ \Delta^k \le\mu^k \bigl|d^k\bigr|. $$
    (3)

    If (3) does not hold, then set x k+1=x k,

    $$ \Delta^{k+1}= \begin{cases} \theta\mu^k |d^k| & \text{if $|d^{k}| \not= 0$}, \\ \theta\Delta^k & \text{if $|d^{k}| = 0$}, \end{cases} $$
    (4)

    k=k+1 and return to Step 1. If (3) holds and |d k|<ε tol , then STOP. Else, continue to the line search.

  4. 3.

    Line Search:

    Attempt to find t k>0 such that

    $$f\bigl(x^k +t^k d^k\bigr) < f \bigl(x^k\bigr) - \eta t^k \bigl|d^k\bigr|^2. $$

    Line Search Failure:

    Set \(\mu^{k+1} = \frac{\mu^{k}}{2}\), x k+1=x k and go to Step 4.

    Line Search Success:

    Let x k+1 be any point such that

    $$f\bigl(x^{k+1}\bigr) \le f\bigl(x^k +t^k d^k\bigr). $$
  5. 4.

    Update and Loop:

    Set Δk+1=max j=1,…,m |y jx k|, k=k+1 and return to Step 1.

In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting point x 0, and initialize the parameter values.

In Step 1, we create the approximate subdifferential. First, we select a set of points around x k within a sampling radius of Δk. In implementation, the points are randomly and uniformly sampled from a ball of radius Δk (using the MATLAB randsphere.m function [36]). Using this set Y, we then calculate an approximate gradient for each of the active functions at x k and set the approximate subdifferential G k equal to the convex hull of these active approximate gradients, ∇ A f i (x k). Details on various approximate gradients appear in Sect. 4.

In Step 2, we generate a search direction by solving the projection of 0 onto the approximate subdifferential: \(\operatorname{Proj}(0|G^{k}) \in\operatorname{arg\,min} _{g \in G^{k}} |g|^{2}\). The search direction d k is set equal to the negative of the solution, i.e., \(d^{k} = -\operatorname{Proj}(0|G^{k})\).

After finding a search direction, we check the inequality Δkμ k|d k|. This inequality determines if the current sampling radius is sufficiently small relative to the distance from 0 to the approximate subdifferential. If this inequality holds and |d k|<ε tol , then we terminate the algorithm, as 0 is within ε tol of the approximate subdifferential and the sampling radius is small enough to reason that the approximate subdifferential is accurate. If the above inequality does not hold, then the approximate subdifferential is not sufficiently accurate to warrant a line search, so we decrease the sampling radius, set x k+1=x k, update the iterate counter and loop (Step 1). If the above inequality holds, but |d k|≥ε tol , then we proceed to a line search.

In Step 3, we carry out a line search. We attempt to find a step length t k>0 such that the Armijo-like condition holds

$$ f\bigl(x^k +t^k d^k\bigr) < f \bigl(x^k\bigr) - \eta t^k \bigl|d^k\bigr|^2. $$
(5)

This condition ensures sufficient decrease is found in the function value. In implementation, we use a back-tracking line search (described in [30]) with an initial step-length of t ini=1, terminating when the step length t k is less than a threshold t min. If we find a t k such that (5) holds, then we declare a line search success. If not, then we declare a line search failure.

If a line search success occurs, then we let x k+1 be any point such that

$$ f\bigl(x^{k+1}\bigr) \le f\bigl(x^k + t^k d^k\bigr). $$
(6)

In implementation, we do this by searching through the function values used in the calculation of our approximate gradients (\(\{f(y^{i})\}_{y^{i} \in Y}\)). As this set of function values corresponds to points distributed around our current iterate, there is a good possibility of finding further function value decrease without having to carry out additional function evaluations. We find the minimum function value in our set of evaluations and if (6) holds for this minimum value, then we set x k+1 equal to the corresponding input point. Otherwise, we set x k+1=x k+t k d k.

If a line search failure occurs, then we reduce the accuracy measure μ k by a factor of \(\frac{1}{2}\) and set x k+1=x k.

Finally, in Step 4, we update the iterate counter and the sampling radius, and then loop to Step 1 to resample.

2.2 Convergence

For the following results, we denote the approximate subdifferential of f at \(\bar{x}\) as

$$G(\bar{x})= \operatorname{conv}\bigl\{ \nabla_A f_i ( \bar{x})\bigr\}_{i \in A(\bar{x})}, $$

where \(\nabla_{A} f_{i} (\bar{x})\) is the approximate gradient of f i at \(\bar{x}\). Our first result establishes an error bound relation between the elements of the approximate subdifferential and the exact subdifferential.

Lemma 2.1

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla _{A} f_{i} (\bar{x}) - \nabla f_{i}(\bar{x})| \le\varepsilon\) for all i=1,…,N. Then

  1. 1.

    for all \(w \in G(\bar{x})\), there exists a \(v \in\partial f(\bar{x})\) such that |wv|≤ε, and

  2. 2.

    for all \(v \in\partial f(\bar{x})\), there exists a \(w \in G(\bar{x})\) such that |wv|≤ε.

Proof

1. By definition, for all \(w \in G(\bar{x})\) there exists a set of α i such that

$$w=\sum_{i \in A(\bar{x})} \alpha_i \nabla_A f_i (\bar{x}),\quad \text{where } \alpha_i \ge0, \sum_{i \in A(\bar{x})} \alpha_i = 1. $$

By our assumption that each \(f_{i} \in\mathcal{C}^{1}\), we have \(\partial f(\bar{x})= \operatorname{conv}\{\nabla f_{i}(\bar{x})\}_{i \in A(\bar{x})}\). Using the same α i as above, we see that

$$v = \sum_{i \in A(\bar{x})} \alpha_i \nabla f_i(\bar{x})\in \partial f(\bar{x}) $$

Then

Hence, for all \(w \in G(\bar{x})\), there exists a \(v \in\partial f(\bar{x})\) such that

$$ |w-v | \le\varepsilon. $$
(7)

2. Analogous arguments can be applied to \(v \in\partial f(\bar{x})\). □

Lemma 2.1 states the quality of the approximate subdifferential as an approximation to the exact subdifferential once the approximate gradients of the component functions are quality approximations to the real gradients. Our next goal (in Theorem 2.4) is to show that eventually a line search success will occur in the AGS algorithm. To achieve this we make use of the following lemma.

Lemma 2.2

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le\varepsilon\) for all i=1,…,N. Define \(d=-\operatorname{Proj} (0|G(\bar{x}))\) and suppose \(|d| \not= 0\). Let β∈[0,1). If ε<(1−β)|d|, then for all \(v \in\partial f(\bar {x})\) we have

$$\langle d, v \rangle< - \beta|d|^2. $$

Proof

Notice that, by the Projection Theorem [3, Theorem 3.14], \(d=-\operatorname{Proj}(0|G(\bar{x}))\) implies that

$$ \bigl\langle0-(-d), w-(-d) \bigr\rangle\le0\quad \text{for all } w \in G( \bar{x}). $$

Hence,

$$ \langle d, w+d \rangle\le0\quad \text{for all } w \in G(\bar{x}). $$
(8)

So we have for all \(v \in\partial f(\bar{x})\),

$$\begin{array}{rcl@{\quad}l} \langle d,v \rangle &=& \langle d, v-w+w-d+d \rangle & \mbox{for all } w \in G(\bar{x}) \\[3pt] &=& \langle d, v-w \rangle+ \langle d,w+d \rangle+ \langle d,-d\rangle & \mbox{for all } w \in G(\bar{x})\\[3pt] &\le&\langle d, v-w \rangle- |d|^2 & \mbox{for all } w \in G(\bar{x})\\[3pt] &\le&|d| |v-w|-|d|^2 & \mbox{for all } w \in G(\bar{x}). \end{array} $$

For any \(v \in\partial f(\bar{x})\), using w as constructed in Lemma 2.1, we see that

 □

Remark 2.3

In Lemma 2.2, for the case when β=0, the condition ε<(1−β)|d| simplifies to ε<|d|. Thus, if ε is bounded above by |d|, then Lemma 2.2 proves that for all \(v \in\partial f(\bar{x})\) we have 〈d,v〉<0, showing that d is a descent direction for f at \(\bar{x}\).

To guarantee convergence, we must show that, except in the case of 0∈∂f(x k), the algorithm will always be able to find a sampling radius that satisfies the requirements in Step 2. In Sect. 4 we show that (for three different approximate gradients) the value ε (in Lemma 2.2) is linked to Δ. As unsuccessful line searches will drive Δ to zero, this implies that eventually the requirements of Lemma 2.2 will be satisfied. We formalize this in the next two theorems.

Theorem 2.4

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose \(0 \not\in\partial f(x^{k})\) for each iteration k. Suppose there exists \(\bar{K} >0 \) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies \(|\nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Let \(d^{k} = - \operatorname{Proj}(0|G(x^{k}))\). Then for any μ>0, there exists \(\bar{\Delta}=\bar{\Delta}(x^{k})>0\) such that,

$$\Delta\le\mu\bigl|d^k\bigr| + \bar{K} \mu\bigl(\Delta^k - \Delta \bigr)\quad \text{\textit{for all} } 0 < \Delta< \bar{\Delta}. $$

Moreover, if \(\Delta^{k} < \bar{\Delta}\), then the following inequality holds

$$\Delta^k \le\mu\bigl| d^k \bigr|. $$

Proof

Let \(\bar{v} = \operatorname{Proj}(0|\partial f(x^{k}))\) (by assumption, \(\bar{v} \neq0\)).

Given μ>0, let

$$ \bar{\Delta} = \frac{1}{\bar{K}+\frac{1}{\mu}} |\bar{v}|, $$
(9)

and consider \(0 < \Delta< \bar{\Delta}\). Now create G(x k) and \(d^{k} = - \operatorname{Proj}(0|G(x^{k}))\). As −d kG(x k), by Lemma 2.1(1), there exists a v k∂f(x k) such that

$$\bigl|-d^k-v^k \bigr| \le\bar{K} \Delta^k. $$

Then

Thus, for \(0 < \Delta< \bar{\Delta}\), we apply (9) to \(|\bar{v}|\) in the above inequality to get

$$\bar{K} \Delta^k \ge\biggl(\bar{K} + \frac{1}{\mu}\biggr) \Delta- \bigl|d^k\bigr|, $$

which rearranges to

$$\Delta\le\mu \bigl|d^k\bigr| + \bar{K} \mu\bigl(\Delta^k - \Delta \bigr). $$

Hence, \(\Delta\le\mu|d^{k}| + \bar{K} \mu(\Delta^{k} - \Delta)\) for all \(0 < \Delta< \bar{\Delta}\). Finally, if \(\Delta^{k} < \bar {\Delta}\), then

$$\Delta^k \le\mu\bigl| d^k\bigr|. $$

 □

Remark 2.5

In Theorem 2.4, it is important to note that eventually the condition \(\Delta^{k} < \bar{\Delta}\) will hold. Examine \(\bar {\Delta}\) as constructed above: \(\bar{K}\) is a constant and \(\bar {v}\) is associated with the current iterate. However, the current iterate is only updated when a line search success occurs, which will not occur unless the condition Δkμ k|d k| is satisfied. As a result, if \(\Delta^{k} \ge\bar{\Delta}\), the AGS algorithm will reduce Δk, with \(\bar{\Delta}\) remaining constant, until \(\Delta^{k} <\bar{\Delta}\).

Recall in Step 3 of the AGS algorithm, for a given η∈(0,1), we attempt to find a step length t k>0 such that

$$f\bigl(x^k +t^k d^k \bigr) < f \bigl(x^k\bigr) - \eta t^k \bigl|d^k\bigr|^2. $$

The following result shows that eventually the above inequality will hold in the AGS algorithm. Recall that the exact subdifferential for a finite max function, as defined in (2), is a compact set. Thus, we know that in the following theorem \(\tilde{v}\) is well-defined.

Theorem 2.6

Fix 0<η<1. Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal{C}^{1}\). Suppose there exists an ε>0 such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le \varepsilon\) for all i=1,…,N. Define \(d=-\operatorname{Proj}(0|G(\bar{x}))\) and suppose \(|d| \not= 0\). Let \(\tilde {v} \in\operatorname{arg\,max}\{ \langle d,v \rangle: v \in\partial f(\bar{x})\}\). Let \(\beta= \frac{2\eta}{1+\eta}\). If ε<(1−β)|d|, then there exists \(\bar{t} >0\) such that

$$f(\bar{x}+t d) - f(\bar{x}) < -\eta t |d|^2\quad \text{\textit{for all} } 0<t<\bar{t}. $$

Proof

Note that β∈(0,1). Recall, from Lemma 2.2, we have for all \(v \in\partial f(\bar{x})\)

$$ \langle d, v \rangle< -\beta|d|^2. $$
(10)

Using \(\beta= \frac{2\eta}{1+\eta}\), (10) becomes

$$ \langle d, v \rangle< - \frac{2\eta}{1+\eta} |d|^2\quad \mbox{for all } v\in\partial f(\bar{x}). $$
(11)

From (11) we can conclude that for all \(v \in \partial f(\bar{x})\)

$$\langle d,v \rangle< 0. $$

Notice that

$$\lim_{\tau\searrow0} \frac{ f(\bar{x}+\tau d)-f(\bar{x}) }{\tau }=\max\bigl\{ \langle d,v \rangle: v \in\partial f(\bar{x})\bigr\}=\langle d, \tilde{v} \rangle<0. $$

Therefore, there exists \(\bar{t} > 0\) such that

$$\frac{ f(\bar{x}+t d)-f(\bar{x}) }{t} < \frac{\eta+1}{2} \langle d, \tilde{v} \rangle\quad \text{for all } 0<t<\bar{t}. $$

For such a t, we have

Hence,

$$f(\bar{x} +t d)-f(\bar{x}) < -\eta t |d|^2\quad \text{for all } 0<t< \bar{t}. $$

 □

Combining the previous results, we can show that the AGS algorithm is guaranteed to find function value decrease (provided 0∉∂f(x k)). We summarize with the following corollary.

Corollary 2.7

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose 0∉∂f(x k) for each iteration k. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k}) - \nabla f_{i} (x^{k})| \le\bar{K} \Delta^{k}\) for all i=1,…,N. Then after a finite number of iterations, the algorithm will find a new iterate with a lower function value.

Proof

Consider x k, where 0∉∂f(x k).

To find function value decrease with the AGS algorithm, we must declare a line search success in Step 3. The AGS algorithm will only carry out a line search if the condition below is satisfied

$$ \Delta^k \le\mu^k \bigl|d^k\bigr|, $$
(12)

where \(d^{k} = -\operatorname{Proj}(0|G(x^{k}))\), as usual. In Theorem 2.4, we showed that for any μ k>0, there exists a \(\bar {\Delta}=\bar{\Delta}(x^{k})>0\) such that if \(\Delta^{k} < \bar{\Delta }(x^{k})\), then (12) is satisfied. If (12) is not satisfied, then Δk is updated according to (4) and x k+1=x k, which further implies \(\bar{\Delta} = \bar{\Delta}(x^{k+1})=\bar{\Delta}(x^{k})\) is unchanged. In this case, whether \(|d^{k}| \not= 0\) or |d k|=0, we can see that Δk+1θΔk. Hence an infinite sequence of (12) being unsatisfied is impossible (as eventually we would have \(\Delta^{k} < \bar{\Delta}\)). So eventually (12) will be satisfied and the AGS algorithm will carry out a line search.

Now, in order to have a line search success, we must be able to find a step length t k such that the Armijo-like condition holds,

$$f\bigl(x^k + t^k d^k\bigr) < f \bigl(x^k\bigr) - \eta t^k \bigl|d^k\bigr|^2. $$

In Theorem 2.6, we showed that there exists \(\bar{t} >0\) such that

$$f\bigl(x^k + t^k d^k\bigr) - f \bigl(x^k\bigr) < - \eta t^k \bigl|d^k\bigr|^2 \quad \text{for all } 0<t^k<\bar{t}, $$

provided that for β∈(0,1),

$$ \varepsilon< (1- \beta) \bigl|d^k\bigr|. $$
(13)

Set \(\varepsilon= \bar{K} \Delta^{k}\). If (13) does not hold, then a line search failure will occur, resulting in μ k+1=0.5μ k. Thus, eventually we will have \(\mu^{k} < \frac {(1-\beta)}{\bar{K}}\) and

$$\Delta^k \le\mu^k \bigl|d^k\bigr| < \frac{(1-\beta)}{\bar{K}} \bigl|d^k\bigr|, $$

which means (13) will hold. Thus, after a finite number of iterations, the AGS algorithm will declare a line search success and find a new iterate with a lower function value. □

We are now ready to prove convergence. In particular, we study the limiting case of the algorithm generating an infinite sequence (i.e., the situation with ε tol =0). In the following, assuming that the step length t k is bounded away from 0 means that there exists a \(\bar{t} > 0\) such that \(t^{k}>\bar{t}\).

Theorem 2.8

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Set ε tol =0 and suppose that \(\{x^{k}\}^{\infty}_{k=0}\) is an infinite sequence generated by the AGS algorithm. Suppose there exists a \(\bar{K} >0 \) such that given any set of points generated in Step 1 of the AGS algorithm, the approximate gradient satisfies the error bound \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le \bar{K} \Delta^{k}\) for all i=1,…,N. Suppose t k is bounded away from 0. Then either

  1. 1.

    f(x k)↓−∞, or

  2. 2.

    |d k|→0, Δk↓0 and every cluster point \(\bar{x}\) of the sequence \(\{ x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).

Proof

If f(x k)↓−∞, then we are done.

Conversely, if f(x k) is bounded below, then f(x k) is non-increasing and bounded below, therefore f(x k) converges. We consider two cases.

Case 1: An infinite number of line search successes occur.

Let \(\bar{x}\) be a cluster point of \(\{x^{k}\}^{\infty}_{k=0}\). Notice that x k only changes for line search successes, so there exists a subsequence \(\{x^{k_{j}}\}^{\infty}_{j=0}\) of line search successes such that \(x^{k_{j}} \to\bar{x}\). Then for each corresponding step length \(t^{k_{j}}\) and direction \(d^{k_{j}}\), the following condition holds

$$f\bigl(x^{k_j+1}\bigr) \le f\bigl(x^{k_j} + t^{k_j} d^{k_j} \bigr) < f\bigl(x^{k_j}\bigr) - \eta t^{k_j} \bigl|d^{k_j}\bigr|^2. $$

Note that

$$0 \le\eta t^{k_j} \bigl|d^{k_j}\bigr|^2 < f \bigl(x^{k_j}\bigr) - f\bigl(x^{k_j+1}\bigr). $$

Since f(x k) converges we know that \(f(x^{k_{j}}) - f(x^{k_{j}+1}) \rightarrow0\). Since \(t^{k_{j}}\) is bounded away from 0, we see that

$$\lim_{j \rightarrow\infty} \bigl|d^{k_j}\bigr| = 0. $$

Recall from the AGS algorithm, we check the condition

$$\Delta^{k_j} \le\mu^{k_j} \bigl|d^{k_j}\bigr|. $$

As \(\Delta^{k_{j}} >0\), \(\mu^{k_{j}} \le\mu^{0}\), and \(|d^{k_{j}}| \to0\), we conclude that \(\Delta^{k_{j}} \downarrow0\).

Finally, from Lemma 2.1(1), as \(-d^{k_{j}} \in G(x^{k_{j}})\), there exists a \(v^{k_{j}} \in\partial f(x^{k_{j}})\) such that

which implies that

$$0 \le\bigl|v^{k_j}\bigr| \le\bar{K} \Delta^{k_j} + \bigl|d^{k_j} \bigr| \to0. $$

So,

$$\lim_{j \to\infty} \bigl|v^{k_j}\bigr| = 0, $$

where \(|v^{k_{j}}| \ge\operatorname{dist}(0|\partial f(x^{k_{j}}))\geq 0\), which implies \(\operatorname{dist}(0|\partial f(x^{k_{j}})) \to0\). We have \(x^{k_{j}} \to \bar{x}\). As f is a finite max function, ∂f is outer semicontinuous (see [35, Definition 5.4 & Proposition 8.7]). Hence, every cluster point \(\bar{x}\) of a convergent subsequence of \(\{x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).

Case 2: A finite number of line search successes occur.

This means there exists a \(\bar{k}\) such that \(x^{k} = x^{\bar{k}} =\bar{x}\) for all \(k \ge\bar{k}\). However, by Corollary 2.7, if \(0 \notin\partial f(\bar{x})\), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have \(0 \in\partial f(\bar{x})\).

To see Δk↓0 and |d k|→0, note that by Lemma 2.1(1) and \(0 \in\partial f(\bar{x})\), we have that for all \(k > \bar{k}\) there exists dG(x k) such that \(|d - 0| \leq\bar{K} \Delta^{k}\). In particular, \(|d^{k}| = |\operatorname{Proj}(0|G(x^{k}))| \leq \bar{K}\Delta^{k} \leq\bar{K}\Delta^{0}\) for all \(k > \bar{k}\). Now note that one of two situations must occur: either (3) is unsatisfied an infinite number of times or after a finite number of steps (3) is always satisfied and a line search failure occurs in Step 4. In the first case, we directly have Δk↓0 (by Step 3). In the second case, we have μ k↓0 (by Step 4), so \(\Delta^{k} \leq\mu^{k}|d^{k}| \leq\mu^{k}\bar{K}\Delta^{0}\) (by (3)) implies Δk↓0. Finally, \(|d^{k}| \leq\bar{K}\Delta^{k}\) completes the proof. □

Our last result shows that if the algorithm terminates in Step 2, then the distance from 0 to the exact subdifferential is controlled by ε tol .

Theorem 2.9

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists a \(\bar{K} >0\) such that for each iteration k, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le\bar{K} \Delta^{k}\) for all i=1,…,N. Suppose the AGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 for ε tol >0. Then

$$\operatorname{dist}\bigl(0|\partial f\bigl(x^{\bar{k}}\bigr)\bigr) < \bigl(1 + \bar{K} \mu ^0\bigr) \varepsilon_{tol}. $$

Proof

Let \(\bar{w} = \operatorname{Proj}(0|G(x^{k}))\). We use \(\bar{v} \in \partial f(x^{k})\) as constructed in Lemma 2.1(1) to see that

The final statement now follows by the test Δkμ k|d k|≤μ 0 ε tol in Step 2. □

3 Robust approximate gradient sampling algorithm

The AGS algorithm depends on the active set of functions at each iterate, A(x k). Of course, it is possible at various times in the algorithm for there to be functions that are inactive at the current iterate, but active within a small radius of the current iterate. Typically such behaviour means that the current iterate is close to a ‘nondifferentiable ridge’ formed by the function. In [6] and [7], it is suggested that allowing an algorithm to take into account these ‘almost active’ functions will provide a better idea of what is happening at and around the current iterate, thus, making the algorithm more robust.

In this section we present the robust gradient sampling algorithm (RAGS algorithm). Specifically, we adapt the AGS algorithm by expanding our active set to include all functions that are active at any of the points in the set Y. Recall from the AGS algorithm that the set Y is sampled from within a ball of radius Δk. Thus, the points in Y are not far from the current iterate. We define the robust active set next.

Definition 3.1

Let f=max{f i :i=1,…,N} where \(f_{i} \in\mathcal{C}^{1}\). Let y 0=x k be the current iterate and Y=[y 0,y 1,y 2,…,y m] be a set of randomly sampled points from a ball centered at y 0 with radius Δk. The robust active set of f on Y is

$$A(Y)= \bigcup_{y^j \in Y} A\bigl(y^j\bigr) \quad. $$

3.1 Algorithm—RAGS

Using the idea of the robust active set, we alter the AGS algorithm to accommodate the robust active set by replacing Steps 1 and 2 with the following.

  1. 1.

    Generate Approximate Subdifferential \(G^{k}_{Y}\) (Robust):

    Generate a set Y=[x k,y 1,…,y m] around the current iterate x k such that

    $$\max_{j=1, \ldots, m} \bigl|y^j - x^k\bigr| \le \Delta^k. $$

    Use Y to calculate the approximate gradient of f i , denoted ∇ A f i , at x k for each iA(Y). Then set \(G^{k} = \operatorname{conv}\{ \nabla_{A} f_{i} (x^{k})\}_{i \in A(x^{k})}\) and \(G^{k}_{Y} = \operatorname{conv}\{ \nabla_{A} f_{i} (x^{k})\}_{i \in A(Y)}\).

  2. 2.

    Generate Search Direction:

    Let

    $$d^k = -\operatorname{Proj}\bigl(0|G^k\bigr). $$

    Let

    $$d^k_Y = -\operatorname{Proj}\bigl(0|G^k_Y \bigr). $$

    Check if

    $$ \Delta^k \le\mu^k \bigl|d^k\bigr|. $$
    (14)

    If (14) does not hold, then set x k+1=x k,

    $$ \Delta^{k+1}= \begin{cases} \theta\mu^k |d^k| & \text{if $|d^{k}| \not= 0$}, \\ \theta\Delta^k & \text{if $|d^{k}| = 0$}, \end{cases} $$

    k=k+1 and return to Step 1. If (14) holds and |d k|<ε tol , then STOP. Else, continue to the line search, using \(d^{k}_{Y}\) as a search direction.

Notice that in Step 2 we still use the stopping conditions from Sect. 2. Although this modification requires the calculation of two projections, it should be noted that neither of these projections are particularly difficult to calculate. In Sect. 3.3, we use the Goldstein approximate subdifferential to adapt Theorem 2.9 to work for stopping conditions based on \(d^{k}_{Y}\), but we still do not have theoretical results for the exact subdifferential. It is important to note that no additional function evaluations are required for this modification.

In the numerics section, we test each version of our algorithm using the robust descent direction to check the stopping conditions. This alteration shows convincing results that the robust stopping conditions not only guarantee convergence, but significantly decrease the number of function evaluations required for the algorithm to converge.

3.2 Convergence

To show that the RAGS algorithm is well-defined we will require the fact that when Δk is small enough, the robust active set is in fact equal to the original active set.

Lemma 3.2

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Let \(Y = [\bar{x}, y^{1}, \ldots,\allowbreak y^{m} ]\) be a randomly sampled set from a ball centered at \(\bar{x}\) with radius Δ. Then there exists an \(\tilde{\varepsilon}>0\) such that if \(Y \subseteq B_{\tilde{\varepsilon}}(\bar{x})\), then \(A(\bar{x})=A(Y)\).

Proof

Clearly, if \(i \in A(\bar{x})\), then iA(Y) as \(\bar{x} \in Y\).

Consider \(i \not\in A(\bar{x})\). Then by the definition of f, we have that

$$f_i (\bar{x}) < f(\bar{x}). $$

By definition, f is continuous, thus, there exists an \(\tilde {\varepsilon}_{i} >0\) such that for all \(z \in B_{\tilde{\varepsilon }_{i}} (\bar{x})\),

$$f_i (z) < f(z). $$

If \(\Delta< \tilde{\varepsilon}_{i}\), then we have \(|y^{j} - \bar{x}| < \tilde{\varepsilon}_{i}\) for all j=1,…,m. Therefore,

$$ f_i \bigl(y^j\bigr) < f \bigl(y^j\bigr) \quad\text{for all } j = 1, \ldots, m, $$
(15)

so \(i \not\in A(Y)\). Setting \(\tilde{\varepsilon} = \min_{i} \tilde{\varepsilon}_{i}\) completes the proof. □

Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-defined when using the robust active set.

Corollary 3.3

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose \(0 \not\in\partial f(x^{k})\) for each iteration k. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k}) | \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Then after a finite number of iterations, the RAGS algorithm will find function value decrease.

Proof

Consider x k, where 0∉∂f(x k).

For eventual contradiction, suppose we do not find function value decrease. In the RAGS algorithm, this corresponds to an infinite number of line search failures. If we have an infinite number of line search failures, then Δk→0, as |d k| is bounded, and \(x^{\bar{k}} = x^{k}\) for all \(\bar{k} \ge k\). In Lemma 3.2, \(\tilde{\varepsilon}\) depends only on x k. Hence, we can conclude that eventually \(\Delta^{k} \le\tilde{\varepsilon}\) and therefore \(Y \subseteq B_{\tilde{\varepsilon}} (x^{k})\). Thus, eventually A(x k)=A(Y k). Once the two active sets are equal, the results of Sect. 2 will hold. □

To examine convergence of the RAGS algorithm we use the result that eventually the robust active set at the current iterate will be a subset of the regular active set at any cluster point of the algorithm.

Lemma 3.4

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Let Y k=[x k,y 1,…,y m] be a randomly sampled set from a ball centered at x k with radius Δk. Let \(x^{k} \to \bar{x}\). Then there exists an \(\tilde{\varepsilon} > 0\) such that if \(Y^{k} \subseteq B_{\tilde{\varepsilon}}(\bar{x})\), then \(A(Y^{k}) \subseteq A(\bar{x})\).

Proof

Let \(i \notin A(\bar{x})\). We must show that for k sufficiently large iA(Y k).

By definition of f, we have that

$$f_i(\bar{x}) < f(\bar{x}). $$

Since f is continuous, there exists an \(\tilde{\varepsilon}_{i}>0\) such that for all \(z \in B_{\tilde{\varepsilon}_{i}} (\bar{x})\)

$$f_i(z) < f(z). $$

If \(Y^{k} \subseteq B_{\tilde{\varepsilon}_{i}}(\bar{x})\), then we have \(|x^{k} - \bar{x}| < \tilde{\varepsilon}_{i}\) and \(|y^{j} - \bar{x}| < \tilde{\varepsilon}_{i}\) for all j=1,…,m. Therefore

$$f_i\bigl(x^k\bigr) < f\bigl(x^k\bigr) $$

and

$$f_i \bigl(y^j\bigr) < f\bigl(y^j\bigr)\quad \text{for all } j = 1, \ldots, m. $$

Thus, if \(Y^{k} \subseteq B_{\tilde{\varepsilon}_{i}} (\bar{x})\), then \(i \not\in A(Y^{k})\). Letting \(\tilde{\varepsilon} = \min_{i} \tilde{\varepsilon}_{i}\) completes the proof. □

Now we examine the convergence of the RAGS algorithm.

Theorem 3.5

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Set ε tol =0 and suppose that \(\{x^{k}\}^{\infty}_{k=0}\) is an infinite sequence generated by the RAGS algorithm. Suppose there exists a \(\bar{K} >0\) such that given any set of points generated in Step 1 of the RAGS algorithm, the approximate gradient satisfies the error bound \(| \nabla_{A} f_{i} (x^{k})- \nabla f_{i}(x^{k})| \le \bar{K} \Delta^{k}\) for all i=1,…,N. Suppose t k is bounded away from 0. Then either

  1. 1.

    f(x k)↓−∞, or

  2. 2.

    |d k|→0, Δk↓0 and every cluster point \(\bar{x}\) of the sequence \(\{ x^{k}\}^{\infty}_{k=0}\) satisfies \(0 \in \partial f(\bar{x})\).

Proof

If f(x k)↓−∞, then we are done.

Conversely, if f(x k) is bounded below, then f(x k) is non-increasing and bounded below, therefore f(x k) converges. We consider two cases.

Case 1: An infinite number of line search successes occur.

Let \(\bar{x}\) be a cluster point of \(\{ x^{k}\}^{\infty}_{k=0}\). Notice that x k only changes for line search successes, so there exists a subsequence \(\{x^{k_{j}}\}^{\infty}_{k=0}\) of line search successes such that \(x^{k_{j}} \to\bar{x}\). Following the arguments of Theorem 2.8, we have \(|d^{k_{j}}| \to0\) and \(\Delta^{k_{j}} \downarrow 0\). Notice that if \(\Delta^{k_{j}} \downarrow0\), then eventually \(Y^{k_{j}} \subseteq B_{\tilde{\varepsilon}} (\bar{x})\), where \(x^{k_{j}} \to\bar{x}\) and \(\tilde{\varepsilon}\) is defined as in Lemma 3.4. Thus, by Lemma 3.4, we have that \(A(Y^{k_{j}}) \subseteq A(\bar{x})\). This means that \(G_{Y^{k_{j}}} (x^{k_{j}})\) is formed from a subset of the approximated active gradients, related to \(\partial f(\bar{x})\). Thus, by Lemma 2.1(1), as \(-d^{k_{j}} \in G_{Y^{k_{j}}}(x^{k_{j}})\), we can construct a \(v^{k_{j}} \in\partial f(\bar{x})\) from the same set of approximated active gradients, related to \(G_{Y^{k_{j}}}(x^{k_{j}})\), such that

Using the Triangle Inequality, we have that

$$\bigl|v^{k_j}\bigr| - \bigl|d^{k_j}\bigr| \le\bar{K} \Delta^{k_j} + \sum _{i \in A(Y^{k_j})} \alpha_i \bigl| \nabla f_i \bigl(x^{k_j}\bigr) - \nabla f_i(\bar {x}) \bigr|. $$

We already showed that \(|d^{k_{j}}| \to0\) and \(\Delta^{k_{j}} \downarrow 0\). Furthermore, since ∇f i is continuous and \(x^{k_{j}} \to\bar {x}\), we have \(| \nabla f_{i}(x^{k_{j}}) - \nabla f_{i}(\bar{x}) | \to0\). So,

$$\lim_{j \to\infty} \bigl|v^{k_j}\bigr| = 0. $$

Using the same arguments as in Theorem 2.8, the result follows.

Case 2: A finite number of line search successes occur.

This means there exists a \(\bar{k}\) such that \(x^{k} = x^{\bar{k}}=\bar{x}\) for all \(k \ge\bar{k}\). However, by Corollary 3.3, if \(0 \notin\partial f(\bar{x})\), then after a finite number of iterations, the algorithm will find function value decrease (line search success). Hence, we have \(0 \in\partial f(\bar{x})\). This implies Δk↓0 and |d k|→0, as in the proof of Theorem 2.8. □

Remark 3.6

Using d k to check our stopping conditions allows the result of Theorem 2.9 to still hold.

3.3 Robust stopping with Goldstein approximate subdifferential

We want to provide some insight as to how Theorem 2.9 can work for stopping conditions based on \(d^{k}_{Y}\), that is, replacing the stopping conditions Δkμ k|d k| and |d k|<ε tol in Step 2 with the robust stopping conditions

$$ \Delta^k \le\mu^k \bigl|d^k_Y\bigr|\quad \text{and}\quad \bigl|d^k_Y\bigr| < \varepsilon_{tol}. $$
(16)

In the situation when the algorithm terminates, the following proposition does not theoretically justify why the stopping conditions are sufficient, but does help explain their logic. Theoretically, since we do not know what \(\bar{x}\) is, we cannot tell when \(A(Y) \subseteq A(\bar{x})\). However, as seen above, we do know that if \(x^{k} \rightarrow\bar{x}\), then eventually \(A(Y^{k}) \subseteq A(\bar{x})\).

Proposition 3.7

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1}\). Suppose there exists a \(\bar{K}> 0\) such that \(|\nabla_{A} f_{i}(x) - \nabla f_{i}(x)| \le\bar{K} \Delta^{k}\) for all i=1,…,N and for all \(x \in B_{\Delta^{k}} (x^{k})\). Suppose the RAGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 using the robust stopping conditions given in (16). Furthermore, suppose there exists \(\bar{x} \in B_{\Delta^{\bar{k}}} (x^{\bar{k}})\) such that \(A(Y^{\bar{k}}) \subseteq A(\bar{x})\). Then

$$\operatorname{dist}\bigl(0|\partial f(\bar{x})\bigr) < \bigl(1+\bar{K} \mu^0\bigr) \varepsilon_{tol}. $$

Proof

If \(A(Y^{\bar{k}}) \subseteq A(\bar{x})\), then the proofs of Lemma 2.1(1) and Theorem 2.9 still hold. □

Additionally, in the following results, we approach the theory for robust stopping conditions using the Goldstein approximate subdifferential. If the RAGS algorithm terminates in Step 2, then it is shown that the distance between 0 and the Goldstein approximate subdifferential is controlled by ε tol . Again, this does not prove the robust stopping conditions are sufficient for the exact subdifferential.

First, the Goldstein approximate subdifferential, as defined in [17], is given by the set

$$ \partial_\Delta^G f(\bar{x}) = \operatorname{conv}\bigl\{ \partial f(z) : z \in B_\Delta(\bar{x}) \bigr \}. $$
(17)

We now show that the Goldstein approximate subdifferential contains all of the gradients of the active functions in the robust active set.

Lemma 3.8

Let f=max{f i :i=1,…,N}. Let Y=[y 0,y 1,y 2,…,y m] be a randomly sampled set from a ball centered at \(y^{0}=\bar{x}\) with radius Δ. If \(f_{i} \in\mathcal{C}^{1}\) for each i, then

$$\partial_\Delta^G f(\bar{x}) \supseteq\operatorname{conv} \bigl\{ \nabla f_i \bigl(y^j\bigr) : y^j \in Y, i \in A\bigl(y^j\bigr)\bigr\}. $$

Proof

If \(f_{i} \in\mathcal{C}^{1}\) for each iA(Y), then by (2), for each y jY we have

$$\partial f\bigl(y^j\bigr) = \operatorname{conv}\bigl\{ \nabla f_i \bigl(y^j\bigr) \bigr\}_{i \in A(y^j)} = \operatorname{conv} \bigl\{ \nabla f_i \bigl(y^j\bigr) : i \in A\bigl(y^j\bigr) \bigr\}. $$

Using this in our definition of the Goldstein approximate subdifferential in (17) and knowing \(B_{\Delta}(\bar{x}) \supseteq Y\), we have

$$\partial_\Delta^G f(\bar{x}) \supseteq\operatorname{conv} \bigl\{ \operatorname{conv}\bigl\{ \nabla f_i \bigl(y^j \bigr) : {i \in A\bigl(y^j\bigr)} \bigr\} : y^j \in Y \bigr\}, $$

which simplifies to

$$ \partial_\Delta^G f(\bar{x}) \supseteq \operatorname{conv}\bigl\{ \nabla f_i \bigl(y^j\bigr) : y^j \in Y, {i \in A\bigl(y^j\bigr)} \bigr\}. $$
(18)

 □

Now we have a result similar to Lemma 2.1(1) for \(d^{k}_{Y}\) with respect to the Goldstein approximate subdifferential.

Remark 3.9

For the following two results, we assume each of the \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L. Note that this implies the Lipschitz constant L is independent of i. If each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L i , then L is easily obtained by L=max{L i :i=1,…,N}.

Lemma 3.10

Let f=max{f i :i=1,…,N} where \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant L. Let Y=[y 0,y 1,y 2,…,y m] be a randomly sampled set from a ball centered at \(y^{0}=\bar{x}\) with radius Δ. Suppose there exists a \(\bar{K} >0\) such that \(| \nabla_{A} f_{i} (\bar{x})- \nabla f_{i}(\bar{x})| \le\bar{K} \Delta \). Then for all \(w \in G_{Y}(\bar{x})\), there exists a \(g \in\partial_{\Delta}^{G} f(\bar{x})\) such that

$$| w-g |\le(\bar{K} + L)\Delta. $$

Proof

By definition, for all \(w \in G_{Y}(\bar{x})\) there exists a set of α i such that

$$w=\sum_{i \in A(Y)} \alpha_i \nabla_A f_i (\bar{x}),\quad\text{where } \alpha_i \ge0, \sum_{i \in A(Y)} \alpha_i = 1. $$

By our assumption that each \(f_{i} \in\mathcal{C}^{1+}\), Lemma 3.8 holds. It is clear that for each iA(Y), iA(y j) for some y jY. Let j i be the index corresponding to this active index; i.e., \(i \in A(y^{j_{i}})\). Thus, for each iA(Y), there is a corresponding active gradient

$$\nabla f_i \bigl(y^{j_i}\bigr) \in\operatorname{conv}\bigl \{ \nabla f_i\bigl(y^{j_i}\bigr): y^{j_i} \in Y, i \in A\bigl(y^{j_i}\bigr) \bigr\} \subseteq\partial^G_\Delta f(\bar{x}). $$

Using the same α i as above, we can construct

$$g = \sum_{i \in A(Y)} \alpha_i \nabla f_i \bigl(y^{j_i}\bigr) \in \operatorname{conv} \bigl\{ \nabla f_i \bigl(y^{j_i}\bigr) : y^{j_i} \in Y, {i \in A\bigl(y^{j_i}\bigr)} \bigr\} \subseteq\partial_\Delta^G f(\bar{x}). $$

Then

 □

Thus, using Lemma 3.10, we can show that if the algorithm stops due to the robust stopping conditions, then the distance from 0 to the Goldstein approximate subdifferential is controlled by ε tol .

Proposition 3.11

Let f=max{f i :i=1,…,N} where each \(f_{i} \in~\mathcal {C}^{1+}\) with Lipschitz constant L. Suppose there exists a \(\bar{K} > 0\) such that for each iteration k, the approximate gradient satisfies \(|\nabla_{A} f_{i}(x^{k}) - \nabla f_{i}(x^{k})| \le\bar{K} \Delta ^{k}\) for all i=1,…,N. Suppose the RAGS algorithm terminates at some iteration \(\bar{k}\) in Step 2 using the robust stopping conditions given in (16). Then

$$\operatorname{dist}\bigl(0|\partial_{\Delta^{\bar{k}}}^G f \bigl(x^{\bar {k}}\bigr)\bigr)< \bigl[1+ \mu^0 (\bar{K} +L) \bigr] \varepsilon_{tol}. $$

Proof

Let \(\bar{w} = \operatorname{Proj}(0|G_{Y^{\bar{k}}}(x^{\bar {k}}))\). We use \(\bar {g} \in\partial^{G}_{\Delta^{\bar{k}}} f(x^{\bar{k}})\) as constructed in Lemma 3.10 to see that

The statement now follows by the test \(\Delta^{\bar{k}} \le \mu^{\bar{k}} |d^{\bar{k}}_{Y}|\) in Step 2 and the fact that \(\mu^{\bar{k}} \le\mu^{0}\) as {μ k} k=0 is a non-increasing sequence. □

4 Approximate gradients

As seen in the previous two sections, in order for convergence to be guaranteed in the AGS or RAGS algorithm, the approximate gradient used must satisfy an error bound for each of the active f i . Specifically, there must exist a \(\bar{K}>0\) such that

$$\bigl| \nabla_A f_i\bigl(x^k\bigr) - \nabla f_i\bigl(x^k\bigr)\bigr| \le\bar{K} \Delta^k, $$

where Δk=max j |y jx k|. In this section, we present three specific approximate gradients that satisfy the above requirement: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function.

4.1 Simplex gradient

The simplex gradient is a commonly used approximate gradient. In recent years, several derivative-free algorithms have been proposed that use the simplex gradient ([5, 11, 12, 22] and [19] among others). It is geometrically defined as the gradient of the linear interpolation of f over a set of n+1 points in ℝn. Mathematically, we define it as follows.

Let Y=[y 0,y 1,…,y n] be a set of affinely independent points in ℝn. We say that Y forms the simplex \(S=\operatorname{conv}\{ Y\}\). Thus, S is a simplex if it can be written as the convex hull of an affinely independent set of n+1 points in ℝn.

The simplex gradient of a function f over the set Y is given by

$$\nabla_s f(Y) = L^{-1} \delta f(Y), $$

where

$$L= \left [ \begin{array}{c@{\quad}c@{\quad}c} y^1 - y^0 & \cdots & y^n-y^0 \end{array} \right ]^\top $$

and

$$\delta f(Y) = \left [ { \begin{array}{c} f(y^1) - f(y^0)\\ \vdots\\ f(y^n)-f(y^0) \end{array} } \right ]. $$

The condition number of the simplex formed by Y is given by \(\| \hat{L}^{-1}\|\), where

$$\hat{L} = \frac{1}{\Delta} \left[ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c} y^1-y^0 & y^2-y^0 & \cdots & y^n-y^0 \end{array} \right]^\top\quad \text{and where } \Delta= \max_{j=1, \ldots, n} \bigl|y^j - y^0\bigr|. $$

4.1.1 Convergence

The following result (by Kelley [23]) shows that there exists an appropriate error bound between the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz constant used in the following theorem corresponds to ∇f i .

Theorem 4.1

Consider \(f_{i} \in\mathcal{C}^{1+}\) with Lipschitz constant K i forf i . Let Y=[y 0,y 1,…,y n] form a simplex. Let

$$\hat{L} = \frac{1}{\Delta} \left [ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c} y^1-y^0 & y^2-y^0 & \cdots & y^n-y^0 \end{array} \right ]^\top,\quad \text{\textit{where} } \Delta= \max_{j=1, \ldots, n} \bigl|y^j - y^0\bigr|. $$

Then the simplex gradient satisfies the error bound

$$\bigl| \nabla_s f_i(Y) - \nabla f_i \bigl(y^0\bigr)\bigr| \le\bar{K} \Delta, $$

where \(\bar{K} = \frac{1}{2} K_{i} \sqrt{n} \| \hat {L}^{-1} \|\).

Proof

See [23, Lemma 6.2.1]. □

With the above error bound result, we conclude that convergence holds when using the simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.

Corollary 4.2

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant K i forf i . If the approximate gradient used in the AGS or RAGS algorithm is the simplex gradient and \(\|\hat{L}^{-1}\|\) is bounded above for each simplex gradient computed, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.

4.1.2 Algorithm—simplex gradient

In order to calculate a simplex gradient in Step 1, we generate a set Y=[x k,y 1,…,y n] of points in ℝn and then check to see if Y forms a well-poised simplex by calculating its condition number, \(\|\hat{L}^{-1}\|\). A bounded condition number (\(\| \hat {L}^{-1}\| < n\)) ensures a ‘good’ error bound between the approximate gradient and the exact gradient.

If the set Y does not form a well-poised simplex (\(\|\hat{L}^{-1}\| \ge n\)), then we resample. If Y forms a well-poised simplex, then we calculate the simplex gradient of f i over Y for each iA(x k) and then set the approximate subdifferential equal to the convex hull of the active simplex gradients. We note that the probability of generating a random matrix with a condition number greater than n is asymptotically constant [38]. Thus, randomly generating simplices is a quick and practical option. Furthermore, notice that calculating the condition number does not require function evaluations; thus, resampling does not affect the number of function evaluations required by the algorithm.

4.2 Centered simplex gradient

The centered simplex gradient is the average of two simplex gradients. Although it requires more function evaluations, it contains an advantage that the error bound satisfied by the centered simplex gradient is in terms of Δ2, rather than Δ.

Let Y=[y 0,y 1,…,y n] form a simplex. We define the sets

$$Y^+ = \bigl[ x, x+\tilde{y}^1, \ldots, x+\tilde{y}^n \bigr] $$

and

$$Y^- = \bigl[ x, x-\tilde{y}^1, \ldots, x-\tilde{y}^n \bigr], $$

where x=y 0 and \(\tilde{y}^{i} = y^{i} - y^{0}\) for i=1,…,n. The centered simplex gradient is the average of the two simplex gradients over the sets Y + and Y , i.e.,

$$\nabla_{CS} f(Y) = \frac{1}{2} \bigl(\nabla_S f \bigl(Y^+\bigr) + \nabla_S f\bigl(Y^-\bigr)\bigr). $$

4.2.1 Convergence

To show that the AGS and RAGS algorithms are both well-defined when using the centered simplex gradient as an approximate gradient, we provide an error bound between the centered simplex gradient and the exact gradient (again by Kelley [23]).

Theorem 4.3

Consider \(f_{i} \in\mathcal{C}^{2+}\) with Lipschitz constant K i for2 f i . Let Y=[y 0,y 1,…,y n] form a simplex. Let

$$\hat{L} = \frac{1}{\Delta} \bigl[y^1 - y^0, \ldots, y^n-y^0\bigr]\quad \text{\textit{where} } \Delta= \max _{j=1, \ldots, n} \bigl|y^j - y^0\bigr|. $$

Then the centered simplex gradient satisfies the error bound

$$\bigl|\nabla_{CS} f_i(Y) - \nabla f_i \bigl(y^0\bigr)\bigr| \le\bar{K} \Delta^2, $$

where \(\bar{K} = K_{i} \sqrt{n} \|\hat{L}^{-1}\|\).

Proof

See [23, Lemma 6.2.5]. □

Notice that Theorem 4.3 requires \(f_{i} \in\mathcal {C}^{2+}\). If \(f_{i} \in\mathcal{C}^{1+}\), then the error bound is in terms of Δ, not Δ2. With the above error bound result, we conclude that convergence holds when using the centered simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.

Corollary 4.4

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{2+}\) with Lipschitz constant K i for2 f i . If the approximate gradient used in the AGS or RAGS algorithm is the centered simplex gradient, \(\|\hat{L}^{-1}\|\) is bounded above for each centered simplex gradient computed and Δ0≤1, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.

Proof

Since Δ0≤1 and non-increasing, (Δk)2≤Δk and ergo, Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold. □

4.2.2 Algorithm

In Step 1, to adapt the AGS or RAGS algorithm to use the centered simplex gradient, we sample our set Y in the same manner as for the simplex gradient (resampling until a well-poised set is achieved). We then form the sets Y + and Y and proceed as expected.

4.3 Gupal estimate

The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [24] uses the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient. We see in Theorem 4.8 that an appropriate error bound exists for this approximate gradient. Surprisingly, unlike the error bounds for the simplex and centered simplex gradients, the error bound in Theorem 4.8 does not include a condition number term. Mathematically, we define the Gupal estimate of the gradient of the Steklov averaged function as follows.

For α>0, the Steklov averaged function f α , as defined in [16, Def. 3.1], is given by

$$ f_\alpha(x)= \int_{\mathbb{R}^n} f(x-y) \psi_\alpha(y) dy, $$

where ψ α :ℝn→ℝ+ is the Steklov mollifier defined by

$$ \psi_\alpha(y) = \left \{ \begin{array}{l@{\quad}l} 1/\alpha^n & \text{if } y \in[-\alpha/2,\alpha/2]^n, \\ 0 & \text{otherwise}. \end{array} \right . $$

We can equivalently define the Steklov averaged function by

$$ f_\alpha(x)= \frac{1}{\alpha^n} \int^{x_1+\alpha/2}_{x_1-\alpha /2} \cdots\int^{x_n+\alpha/2}_{x_n-\alpha/2} f(y) dy_1 \cdots dy_n. $$
(19)

The partial derivatives of f α are given by ([16, Prop. 3.11] and [18])

$$ \frac{\partial f_\alpha}{\partial x_i} (x) = \int_{\mathbb {B}_\infty} \gamma_i(x, \alpha, \zeta) d\zeta $$
(20)

for i=1,…,n, where \(\mathbb{B}_{\infty}= [-1/2, 1/2]^{n}\) is the unit cube centred at 0 and

(21)

Given α>0 and \(z=(\zeta^{1}, \ldots, \zeta^{n}) \in\prod^{n}_{i=1} \mathbb{B}_{\infty}\), the Gupal estimate of ∇f α (x) over z is given by

$$ \gamma({x},{\alpha},{z}) = \bigl(\gamma_{1}\bigl(x,\alpha,\zeta^1\bigr), \ldots, \gamma_{n}\bigl(x,\alpha, \zeta^n\bigr)\bigr). $$
(22)

Remark 4.5

Although we define γ(x,α,z) as the Gupal estimate of ∇f α (x), in Sect. 4.3.1, we will show that γ(x,α,z) provides a good approximation to the exact gradient, ∇f(x).

Remark 4.6

For the following results, we note that the α used in the above definitions is equivalent to our sampling radius Δ. Thus, we will be replacing α with Δ in the convergence results in Sect. 4.3.1.

4.3.1 Convergence

As before, in order to show that both the AGS and RAGS algorithms are well-defined when using the Gupal estimate as an approximate gradient, we must establish that it provides a good approximate of our exact gradient. To do this, we first need the following classic result from [13].

Lemma 4.7

[13, Lemma 4.1.12]

Let \(f \in\mathcal{C}^{1+}\) with Lipschitz constant K forf. Let y 0∈ℝn. Then for any y∈ℝn

$$\bigl| f(y) - f\bigl(y^0\bigr) - \nabla f\bigl(y^0 \bigr)^\top\bigl(y-y^0\bigr) \bigr| \le\frac {1}{2} K \bigl|y-y^0\bigr|^2. $$

Using Lemma 4.7, we establish an error bound between the Gupal estimate and the exact gradient of f.

Theorem 4.8

Consider \(f_{i} \in\mathcal{C}^{1+}\) with Lipschitz constant K i forf i . Let ε>0. Then for Δ>0, \(z = ( \zeta ^{1}, \ldots, \zeta^{n} ) \in Z = \prod^{n}_{i=1} \mathbb{B}_{\infty}\) and any point x∈ℝn, the Gupal estimate off i(x) satisfies the error bound

$$\bigl| \gamma(x,\Delta,z) - \nabla f_i(x) \bigr| \le\sqrt {n} { \frac{1}{2} K_i \Delta(\sqrt{n} +3)}. $$

Proof

For Δ>0, let

$$y^{j-} = \biggl[x_1 + \Delta\zeta_1, \ldots, x_{j-1}+ \Delta\zeta_{j-1}, x_j-\frac{1}{2} \Delta, x_{j+1} + \Delta\zeta_{j+1}, \ldots, x_n + \Delta\zeta_n\biggr]^\top $$

and

$$y^{j+}= \biggl[x_1 + \Delta\zeta_1, \ldots, x_{j-1}+ \Delta\zeta_{j-1}, x_j+\frac{1}{2} \Delta, x_{j+1} + \Delta\zeta_{j+1}, \ldots, x_n + \Delta\zeta_n\biggr]^\top. $$

Applying Lemma 4.7, we have that

$$ \bigl| f_i\bigl(y^{j+}\bigr) - f_i \bigl(y^{j-}\bigr) - \nabla f_i\bigl(y^{j-} \bigr)^\top \bigl(y^{j+}-y^{j-}\bigr) \bigr| \le \frac{1}{2} K_i \bigl|y^{j+}-y^{j-}\bigr|^2. $$
(23)

From (21) (with α=Δ), we can see that

$$f_i\bigl(y^{j+}\bigr)-f_i\bigl(y^{j-} \bigr) = \Delta\gamma_{j}\bigl(x,\Delta,\zeta^j \bigr). $$

Hence, (23) becomes

$$ \bigl| \Delta\gamma_{j}\bigl(x,\Delta, \zeta^j\bigr) - \nabla f_i\bigl(y^{j-} \bigr)^\top\bigl(y^{j+}-y^{j-}\bigr) \bigr| \le \frac{1}{2} K_i \bigl|y^{j+}-y^{j-}\bigr|^2. $$
(24)

From our definitions of y j and y j+, we can see that

$$y^{j+}-y^{j-} = [ 0, \ldots, 0, \Delta, 0, \ldots, 0]^\top. $$

The inner product in (24) simplifies to

$$\nabla f_i\bigl(y^{j-}\bigr)^\top \bigl(y^{j+}-y^{j-}\bigr) = \Delta\frac{\partial f_i}{\partial x_j} \bigl(y^{j-}\bigr). $$

Thus, we have

$$\biggl| \Delta\gamma_{j}\bigl(x,\Delta,\zeta^j\bigr) - \Delta\frac{\partial f_i}{\partial x_j} \bigl(y^{j-}\bigr) \biggr| \le\frac {1}{2} K_i \Delta^2, $$

implying

$$ \biggl| \gamma_{j}\bigl(x,\Delta,\zeta^j\bigr) - \frac {\partial f_i}{\partial x_j} \bigl(y^{j-}\bigr) \biggr| \le \frac{1}{2} K_i \Delta. $$
(25)

Also notice that

$$\bigl|y^{j-}-x\bigr| = \Delta \biggl|\biggl( \zeta^j_1, \ldots, \zeta^j_{j-1}, -\frac{1}{2}, \zeta^j_{j+1}, \ldots, \zeta^j_n \biggr) \biggr|. $$

Using the standard basis vector e j, we have

$$\bigl|y^{j-}-x\bigr| = \Delta \bigl| \zeta^j - \zeta_j e^j - \frac{1}{2} e^j \bigr| \le \Delta \biggl( \bigl| \zeta^j \bigr| + \bigl| \zeta_j e^j\bigr| + \frac {1}{2} \biggr) \le { \frac{1}{2} \Delta(\sqrt{n} +2) }. $$

Thus, since \(f_{i} \in\mathcal{C}^{1+}\), we have

$$ \bigl| \nabla f_i \bigl(y^{j-}\bigr) - \nabla f_i (x) \bigr| \le K_i { \frac {1}{2} \Delta(\sqrt{n} +2) }. $$
(26)

Noting that

$$\biggl| { \frac{\partial f_i}{\partial x_j} \bigl(y^{j-}\bigr) - \frac{\partial f_i}{\partial x_j} (x) } \biggr| \le \bigl| \nabla f_i \bigl(y^{j-}\bigr) - \nabla f_i (x) \bigr| , $$

we have

$$ \biggl| { \frac{\partial f_i}{\partial x_j} \bigl(y^{j-}\bigr) - \frac{\partial f_i}{\partial x_j} (x) } \biggr| \le K_i \frac{1}{2} \Delta(\sqrt{n} +2) . $$
(27)

Using (25) and (27), we have

Finally,

 □

We conclude that convergence holds when using the Gupal estimate of the gradient of the Steklov averaged function of f as an approximate gradient in both the AGS and RAGS algorithms.

Corollary 4.9

Let f=max{f i :i=1,…,N} where each \(f_{i} \in\mathcal {C}^{1+}\) with Lipschitz constant K i forf i . If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate of the gradient of the Steklov averaged function, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.

4.3.2 Algorithm

To use the Gupal estimate of the gradient of the Steklov averaged function in both the AGS and RAGS algorithms, in Step 1, we sample independently and uniformly \(\{z^{kl}\}^{m}_{l=1}\) from the unit cube in ℝn×n, respectively, where m is the number of active functions.

5 Numerical results

5.1 Versions of the AGS and RAGS algorithms

We implemented the AGS and RAGS algorithms using the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as approximate gradients.

Additionally, we used the robust descent direction to create robust stopping conditions. That is, the algorithm terminates when

$$ \Delta^k \le\mu^k \bigl|d^k_Y\bigr| \quad\mbox{and} \quad \bigl|d^k_Y\bigr| < \varepsilon_{tol}, $$
(28)

where \(d^{k}_{Y}\) is the projection of 0 onto the approximate subdifferential generated using the robust active set (see Proposition 3.11 for results linking the robust stopping conditions with the Goldstein approximate subdifferential). The implementation was done in MATLAB (v. 7.11.0.584, R2010b). Software is available by contacting the corresponding author.

Let d k denote the regular descent direction and let \(d^{k}_{Y}\) denote the robust descent direction. There are three scenarios that could occur when using the robust stopping conditions:

  1. 1.

    \(|d^{k}| = |d^{k}_{Y}|\);

  2. 2.

    \(|d^{k}| \ge|d^{k}_{Y}|\), but checking the stopping conditions leads to the same result (line search, radius decrease or termination); or

  3. 3.

    \(|d^{k}| \ge|d^{k}_{Y}|\), but checking the stopping conditions leads to a different result.

In Scenarios 1 and 2, the robust stopping conditions have no influence on the algorithm. In Scenario 3, we have two cases:

  1. 1.

    \(\Delta^{k} \le\mu^{k} |d^{k}_{Y}| \le\mu^{k} |d^{k}|\), but |d k|≥ε tol and \(|d^{k}_{Y}| < \varepsilon_{tol}\) or

  2. 2.

    Δkμ k|d k| holds, but \(\Delta^{k} > \mu^{k} |d^{k}_{Y}|\).

Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS algorithms to do one of two things: to terminate early, providing a solution that has a smaller quality measure, but requires less function evaluations to find, or to reduce its sampling radius instead of carrying out a line search, reducing the number of function evaluations carried out during that iteration and calculating a more accurate approximate subdifferential at the next iteration.

Our goal in this testing is to determine if there are any notable numerical differences in the quality of the three approximate gradients (simplex, centered simplex, and Gupal estimate), the two search directions (robust and non-robust), and the two stopping conditions (robust and non-robust). This leads to the following 12 versions:

  • AGS Simplex (1. non-robust/2. robust stopping)

  • RAGS Simplex (3. non-robust/4. robust stopping)

  • AGS Centered Simplex (5. non-robust/6. robust stopping)

  • RAGS Centered Simplex (7. non-robust/8. robust stopping)

  • AGS Gupal (9. non-robust/10. robust stopping)

  • RAGS Gupal (11. non-robust/12. robust stopping)

5.2 Test sets and software

Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro. We used the test set from Lukšan-Vlček [27]. The first 25 problems presented are of the desired form

$$\min_x \bigl\{ F(x) \bigr\}\quad \text{where } F(x) = \max _{i=1,2,\ldots, N } \bigl\{ f_i (x) \bigr\}. $$

Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued. Thus, our test set presents a total of 24 finite minimax problems with dimensions from 2 to 20. There are several problems with functions f i that have the form f i =|f i |, where f i is a smooth function. We rewrote these functions as f i =max{f i ,−f i }. The resulting test problems have from 2 to 130 sub-functions. A summary of the test problems appears in Table 2 in the Appendix.

5.3 Initialization and stopping conditions

We first describe our choices for the initialization parameters used in the AGS and RAGS algorithms.

The initial starting points are given for each problem in [27]. We set the initial accuracy measure to 0.5 with a reduction factor of 0.5. We set the initial sampling radius to 0.1 with a reduction factor of 0.5. The Armijo-like parameter η was chosen to be 0.1 to ensure that a line search success resulted in a significant function value decrease. We set the minimum step length to 10−10.

Next, we discuss the stopping tolerances used to ensure finite termination of the AGS and RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The first is our theoretical stopping condition, while the remaining three are to ensure numerical stability of the algorithm.

  1. 1.

    Stopping conditions met—As stated in the theoretical algorithm, the algorithm terminates for this reason when Δkμ k|d k| and |d k|<ε tol , where d k is either the regular or the robust descent direction.

  2. 2.

    Hessian matrix has NaN/Inf entries—For the solution of the quadratic program in Step 2, we use the quadprog command in MATLAB, which has certain numerical limitations. When these limitations result in NaN or Inf entries in the Hessian, the algorithm terminates.

  3. 3.

    Δk, μ k, and |d k| are small—This stopping criterion bipasses the test Δkμ k|d k| (in Step 2) and stops if Δktol, |μ k|<μ tol and |d k|<ε tol . Examining Theorem 2.9 along with Theorems 4.1, 4.3 and 4.8, it is clear that this is also a valid stopping criterion. We used a bound of 10−6 in our implementation for both Δk and μ k.

  4. 4.

    Max number of function evaluations reached—As a final failsafe, we added an upper bound of 106 on the number of function evaluations allowed. (This stopping condition only occurs once in our results.)

5.4 Results

Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each version. For each of the 25 trials, we record the number of function evaluations, the number of iterations, the solution, the quality of the solution and the reason for termination. The quality was measured by the improvement in the number of digits of accuracy, which is calculated using the formula

$$-\log \biggl( \frac{| F_{\min}-F^* |}{| F_0 -F^* |} \biggr), $$

where F min is the function value at the final (best) iterate, F is the true minimum value (optimal value) of the problem (as given in [27]) and F 0 is the function value at the initial iterate. Results on function evaluations and solution quality appear in Tables 3, 4 and 5 of the Appendix.

To visually compare algorithmic versions, we use performance profiles. A performance profile is the (cumulative) distribution function for a performance metric [14]. For the AGS and RAGS algorithms, the performance metric is the ratio of the number of function evaluations taken by the current version to successfully solve each test problem versus the least number of function evaluations taken by any of the versions to successfully solve each test problem. Performance profiles eliminate the need to discard failures in numerical results and provide a visual representation of the performance difference between several solvers. For full details on the construction of performance profiles, see [14].

In Figs. 1(a) and 1(b) we include a performance profile showing all 12 versions of the AGS and RAGS algorithms tested, declaring a success for 1 digit and 3 digits of improvement, respectively.

Fig. 1
figure 1

Performance profiles for 12 versions of AGS/RAGS algorithm

In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient does not produce the best results. It only produces 1 or more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is no significant difference between the performance of the AGS and RAGS algorithms using the simplex and centered simplex gradients as approximate gradients.

Looking at the results in Tables 3, 4 and 5, and our performance profiles, we can make the following two observations:

  1. (i)

    the versions of the RAGS algorithm generally outperform (converge faster than) the versions of the AGS algorithm, and

  2. (ii)

    the RAGS algorithm using the robust stopping conditions terminates faster and with lower (but still significant) accuracy.

Robust active set: From our results, it is clear that expanding the active set to include ‘almost active’ functions in the RAGS algorithm greatly improves performance for the simplex and centered simplex algorithm. This robust active set brings more local information into the approximate subdifferential and thereby allows for descent directions that are more parallel to any nondifferentiable ridges formed by the function.

Robust stopping conditions: We notice from the performance profiles that in terms of function evaluations, the robust stopping conditions improve the overall performance of the RAGS algorithm, although they decrease the average accuracy on some problems. These results correspond with our previously discussed hypothesis. Furthermore, upon studying the reasons for termination, it appears that the non-robust stopping conditions cause the AGS and RAGS algorithms to terminate mainly due to Δk and μ k becoming too small. For the robust stopping conditions, the RAGS algorithm terminated often because the stopping conditions were satisfied. As our theory in Sect. 3.3 is not complete, we cannot make any theoretical statements about how the robust stopping conditions would perform in general (like those in Theorem 2.9). However, from our results, we conjecture that the alteration is beneficial for decreasing function evaluations.

In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the RAGS algorithm either matches or outperforms the AGS algorithm in average accuracy obtained over 25 trials using the simplex and centered simplex gradients. Knowing this, we conclude that the improvement of the accuracy is due to the choice of a robust search direction.

5.5 Comparison to NOMAD

NOMAD is a well established general DFO solver, not specialized for the class of minimax problems [1, 25]. Using the same set of 24 test problems as we did with our previous tests, we compare the RAGS algorithm to the NOMAD algorithm. Based on the previous tests we use the simplex gradient with robust stopping conditions. All parameters for the NOMAD algorithm we left to default settings. The performance profile generated by these two algorithms for the improvement of 3 digits of accuracy appears in Fig. 2.

Fig. 2
figure 2

Performance profile comparing RAGS with NOMAD for an improvement of 3 digits of accuracy

As seen in Fig. 2, NOMAD appears slightly faster than RAGS, but slightly less robust. Overall, the two algorithms are fairly comparable. Further research into RAGS may lead to algorithmic improvement. For example, a more informed selection of the sample set Y in Step 1, an improved line search, or better selection of default parameters for RAGS, could all lead to improved convergence. We leave such research for future study.

6 Conclusion

We have presented a new derivative-free algorithm for finite minimax problems that exploits the smooth substructure of the problem. Convergence results are given for any arbitrary approximate gradient that satisfies an error bound dependent on the sampling radius. Three examples of such approximate gradients are given. Additionally, a robust version of the algorithm is presented and shown to have the same convergence results as the regular version.

Through numerical testing, we find that the robust version of the algorithm outperforms the regular version with respect to both the number of function evaluations required and the accuracy of the solutions obtained. Additionally, we tested robust stopping conditions and found that they generally required less function evaluations before termination. Overall, the robust stopping conditions paired with the robust version of the algorithm performed best (as seen in the performance profiles).

Considerable future work is available in this research direction. Most obvious is an exploration of the theory behind the performance of the robust stopping conditions. Another direction lies in the theoretical requirement bounding the step length away from 0 (see Theorems 2.8 and 3.5). In gradient based methods, one common way to avoid this requirement is with the use of Wolfe-like conditions. We are unaware of any derivative-free variant on the Wolfe conditions.