Key words

Mathematics Subject Classifications (2010)

26.1 Introduction

We considerFootnote 1 optimization programs of the form

$$\displaystyle\begin{array}{rcl} \min _{x\in {\mathbb{R}}^{n}}f(x),& &{}\end{array}$$
(26.1)

where \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) is locally Lipschitz but neither differentiable nor convex. We present a bundle algorithm which converges to a critical point of (26.1) if exact function and subgradient evaluation of f are provided and to an approximate critical point if subgradients or function values are inexact. Here \(\bar{x} \in {\mathbb{R}}^{n}\) is approximate critical if

$$\displaystyle\begin{array}{rcl} \text{ dist}\left (0,\partial f(\bar{x})\right ) \leq \varepsilon,& &{}\end{array}$$
(26.2)

where ∂ f(x) is the Clarke subdifferential of f at x.

The method discussed here extends the classical bundle concept to the nonconvex setting by using downshifted tangents as a substitute for cutting planes. This idea was already used in the 1980s in Lemaréchal’s M2FC1 code [32] or in Zowe’s BT codes [48, 54]. Its convergence properties can be assessed by the model-based bundle techniques [6, 7, 40, 42]. Recent numerical experiments using the downshift mechanism are reported in [8, 19, 50]. In the original paper of Schramm and Zowe [48] downshift is discussed for a hybrid method combining bundling, trust region, and line-search elements.

For convex programs (26.1) bundle methods which can deal with inexact function values or subgradients have been discussed at least since 1985; see Kiwiel [26, 28]. More recently, the topic has been revived by Hintermüller [22], who presented a method with exact function values but inexact subgradients g ɛ f(x), where ɛ remains unknown to the user. Kiwiel [30] expands on this idea and presents an algorithm which deals with inexact function values and subgradients, both with unknown errors bounds. Kiwiel and Lemaréchal [31] extend the idea further to address column generation. Incremental methods to address large problems in stochastic programming or Lagrangian relaxation can be interpreted in the framework of inexact values and subgradients; see, e.g., Emiel and Sagastizábal [15, 16] and Kiwiel [29]. In [39] Nedic and Bertsekas consider approximate functions and subgradients which are in addition affected by deterministic noise.

Nonsmooth methods without convexity have been considered by Wolfe [52], Shor [49], Mifflin [38], Schramm and Zowe [48], and more recently by Lukšan and Vlček [35], Noll and Apkarian [41], Fuduli et al. [17, 18], Apkarian et al. [6], Noll et al. [42], Hare and Sagastizábal [21], Sagastizábal [47], Lewis and Wright [33], and Noll [40]. In the context of control applications, early contributions are Polak and Wardi [44], Mayne and Polak [36, 37], Kiwiel [27], Polak [43], Apkarian et al. [17], and Bompart et al. [9]. All these approaches use exact knowledge of function values and subgradients.

The structure of the paper is as follows. In Sect. 26.2 we explain the concept of an approximate subgradient. Section 26.3 discusses the elements of the algorithm, acceptance, tangent program, aggregation, cutting planes, recycling, and the management of proximity control. Section 26.4 presents the algorithm. Section 26.5 analyzes the inner loop in the case of exact function values and inexact subgradients. Section 26.6 gives convergence of the outer loop. Section 26.7 extends to the case where function values are also inexact. Section 26.8 uses the convergence theory of Sects. 26.526.7 to derive a practical stopping test. Section 26.9 concludes with a motivating example from control.

26.2 Preparation

Approximate subgradients in convex bundle methods refer to the ɛ-subdifferential [24]:

$$\displaystyle\begin{array}{rcl} \partial _{\varepsilon }f(x) =\{ g \in {\mathbb{R}}^{n}: {g}^{\top }(y - x) \leq f(y) - f(x) +\varepsilon \mbox{ for all }y \in {\mathbb{R}}^{n}\},& &{}\end{array}$$
(26.3)

whose central property is that \(0 \in \partial f_{\varepsilon }(\bar{x})\) implies ɛ-minimality of \(\bar{x}\), i.e., \(f(\bar{x}) \leq \min f+\varepsilon\). Without convexity we cannot expect a tool with similar global properties. We shall work with the following very natural approximate subdifferential

$$\displaystyle\begin{array}{rcl} \partial _{[\varepsilon ]}f(x) = \partial f(x) +\varepsilon B,& &{}\end{array}$$
(26.4)

where B is the unit ball in some fixed Euclidian norm and ∂ f(x) is the Clarke subdifferential of f. The present section motivates this choice.

The first observation concerns the optimality condition (26.2) arising from the choice (26.4). Namely \(0 \in \partial _{[\varepsilon ]}f(\bar{x})\) can also be written as \(0 \in \partial (f +\varepsilon \| \cdot - x\|)(x)\), meaning that a small perturbation of f is critical at x.

We can also derive a weak form of ɛ-optimality from \(0 \in \partial _{[\varepsilon ]}f(x)\) for composite functions f = gF with g convex and F smooth or, more generally, for lower C 2 functions (see [45]) which have such a representation locally.

Lemma 26.1.

Let f = g ∘ F where g is convex and F is of class C 2 , and suppose \(0 \in \partial _{[\varepsilon ]}f(x)\) . Fix r > 0, and define

$$\displaystyle{c_{r}\,:=\,\max _{\|d\|=1}\max _{\|x{^\prime}-x\|\leq r}\max_ {{\phi \in \partial g(F(x))}\phi }^{\top }{D}^{2}F(x{^\prime})[d,d].}$$

Then x is \((r\varepsilon + {r}^{2}c_{r}/2)\) -optimal on the ball B(x,r).

Proof.

We have to prove \(f(x) \leq f({x}^{+}) + r\varepsilon + {r}^{2}c_{r}/2\) for every x +B(x,r). Write \({x}^{+} = x + td\) for some \(\|d\| = 1\) and tr. Since 0 ∈ [ɛ] f(x), and since \(\partial f(x) = DF{(x)}^{{\ast}}\partial g(F(x))\), there exists ϕ∂ g(F(x)) such that \(\|DF{(x)}^{{\ast}}\phi \|\leq \varepsilon\). In other words, \({\|\phi }^{\top }DF(x)d\| \leq \varepsilon\) because \(\|d\| = 1\). By the subgradient inequality we have

$$ \displaystyle\begin{array}{rcl} {\phi }^{\top }\left (F(x + td) - F(x)\right ) \leq g(F(x + td)) - g(F(x)) = f({x}^{+}) - f(x).& &{}\end{array}$$
(26.5)

Second-order Taylor expansion of tϕ F(x + td) at t = 0 gives

$${\displaystyle{\phi }^{\top }F(x + td) {=\phi }^{\top }F(x) + {t\phi }^{\top }DF(x)d +{ \frac{{t}^{2}} {2} \phi }^{\top }{D}^{2}F(x_{ t})[d,d]}$$

for some x t on the segment [x,x + td]. Substituting this into (26.5) and using the definition of c r give

$$\displaystyle{f(x) \leq f({x}^{+}) + {t\|\phi }^{\top }DF(x)d\| +{ \frac{{t}^{2}} {2} \|\phi }^{\top }{D}^{2}F(x_{ t})[d,d]\| \leq f({x}^{+}) + r\varepsilon + \frac{{r}^{2}} {2} c_{r},}$$

hence the claim.■

Remark 26.2.

For convex f we can try to relate the two approximate subdifferentials in the sense that

$$\displaystyle{\partial _{\varepsilon }f(x) \subset \partial _{[\varepsilon {^\prime}]}f(x)}$$

for a suitable ɛ′ =ɛ′(x,ɛ). For a convex quadratic function \(f(x) = \frac{1} {2}{x}^{\top }Qx + {q}^{\top }x\) it is known that \(\partial _{\varepsilon }f(x) =\{ \nabla f(x) + {Q}^{1/2}z: \frac{1} {2}\|{z\|}^{2} \leq \varepsilon \}\), [24], so that \(\partial _{\varepsilon }f(x) \subset \partial f(x) +\varepsilon {^\prime}B = \partial _{[\varepsilon {^\prime}]}f(x)\) for \(\varepsilon {^\prime} =\sup \{\| {Q}^{1/2}z\|: \frac{1} {2}\|{z\|}^{2} \leq \varepsilon \}\), which means that ɛ′(x,ɛ) is independent of x and behaves as \(\varepsilon {^\prime} = \mathcal{O}{(\varepsilon }^{1/2})\). We expect this type of relation to hold as soon as f has curvature information around x. On the other hand, if f(x) = |x|, then \(\partial f_{\varepsilon }(x) = \partial f(x) + \frac{\varepsilon } {\vert x\vert }B\) for x≠0 (and ɛ f(0) = ∂ f(0)), which means that the relationship \(\varepsilon {^\prime} =\varepsilon /\vert x\vert \) is now linear in ɛ for fixed x≠0. In general it is difficult to relate ɛ to ɛ′. See Hiriart-Urruty and Seeger [23] for more information on this question.

Remark 26.3.

For composite functions f = gF with g convex and F of class C 1 we can introduce

$$\displaystyle{\partial _{\varepsilon }f(x) = DF{(x)}^{{\ast}}\partial _{\varepsilon }g(F(x)),}$$

where ɛ g(y) is the usual convex ɛ-subdifferential (26.3) of g and DF(x) is the adjoint of the differential of F at x. Since the corresponding chain rule is valid in the case of an affine F, ɛ f(x) is consistent with (26.3). Without convexity ∂ f ɛ (x) no longer preserves the global properties of (26.3). Yet, for composite functions f = gF, a slightly more general version of Lemma 26.1 combining [σ] f and ɛ f can be proved along the lines of [41, Lemma 2]. In that reference the result is shown for the particular case g =λ 1, but an extension can be obtained by reasoning as in Lemma 26.1.

Remark 26.4.

For convex f the set [ɛ] f(x) coincides with the Fréchet ɛ-sub differential \(\partial _{\varepsilon }^{F}f(x)\). According to [34, Corollary 3.2] the same remains true for approximate convex functions. For the latter see Sect. 26.5.

26.3 Elements of the Algorithm

26.3.1 Local Model

Let x be the current iterate of the outer loop. The inner loop with counter k generates a sequence y k of trial steps, one of which is eventually accepted to become the new serious step x +. At each instant k we dispose of a convex working model ϕ k ( ⋅,x), which approximates f in a neighborhood of x. We suppose that we know at least one approximate subgradient g(x) ∈ [ɛ] f(x). The affine function

$$\displaystyle{m_{0}(\cdot,x) = f(x) + g{(x)}^{\top }(\cdot - x)}$$

will be referred to as the exactness plane at x. For the moment we assume that it gives an exact value of f at x, but not an exact subgradient. The algorithm assures \(\phi _{k}(\cdot,x) \geq m_{0}(\cdot,x)\) at all times k, so that \(g(x) \in \partial \phi _{k}(x,x)\) for all k. In fact we construct ϕ k ( ⋅,x) in such a way that \(\partial \phi _{k}(x,x) \subset \partial _{[\varepsilon ]}f(x)\) at all times k.

Along with the first-order working model ϕ k ( ⋅,x) we also consider an associated second-order model of the form

$$\displaystyle{\Phi _{k}(y,x) =\phi _{k}(y,x) + \frac{1} {2}{(y - x)}^{\top }Q(x)(y - x),}$$

where Q(x) depends on the serious iterate x, but is fixed during the inner loop k. We allow Q(x) to be indefinite.

26.3.2 Cutting Planes

Suppose y k is a null step. Then model Φ k ( ⋅,x) which gave rise to y k was not rich enough and we have to improve it at the next inner loop step k + 1 in order to perform better. We do this by modifying the first-order part. In convex bundling one includes a cutting plane at y k into the new model ϕ k+1( ⋅,x). This remains the same with approximate subgradients and values (cf. [22, 30]) as soon as the concept of cutting plane is suitably modified. Notice that we have access to \(g_{k} \in \partial _{[\varepsilon ]}f({y}^{k})\), which gives us an approximate tangent

$$\displaystyle{t_{k}(\cdot ) = f({y}^{k}) + g_{ k}^{\top }(\cdot - {y}^{k})}$$

at y k. Since f is not convex, we cannot use t k ( ⋅) directly as cutting plane. Instead we use a technique originally developed in Schramm and Zowe [48] and Lemaréchal [32], which consists in shifting t k ( ⋅) downwards until it becomes useful for ϕ k+1( ⋅,x). Fixing c > 0 once and for all, we call

$$\displaystyle\begin{array}{rcl} s_{k}\,:=\,\left [t_{k}(x) - f(x)\right ]_{+} + c\|{y}^{k} - {x\|}^{2}& &{}\end{array}$$
(26.6)

the downshift and introduce

$$\displaystyle{m_{k}(\cdot,x) = t_{k}(\cdot ) - s_{k},}$$

called the downshifted tangent.

We sometimes use the following more stringent notation, where no reference to the counter k is made. The approximate tangent is \(t_{y,g}(\cdot ) = f(y) + {g}^{\top }(\cdot - y)\), bearing a reference to the point y where it is taken and to the specific approximate subgradient g [ɛ] f(y). The downshifted tangent is then \(m_{y,g}(\cdot,x) = t_{y,g}(\cdot ) - s\), where \(s = s(y,g,x) = \left [t_{y,g}(x) - f(x)\right ]_{+} + c\|y - {x\|}^{2}\) is the downshift. Since this notation is fairly heavy, we will try to avoid it whenever possible and switch to the former, bearing in mind that t k ( ⋅) depends both on y k and the subgradient \(g_{k} \in \partial _{[\varepsilon ]}f({y}^{k})\). Similarly, the downshifted tangent plane m k ( ⋅,x) depends on y k, g k, and on x, as does the downshift s k . We use m k ( ⋅,x) as a substitute for the classical cutting plane. For convenience we continue to call m k ( ⋅,x) a cutting plane.

The cutting plane satisfies \(m_{k}(x,x) \leq f(x) - c\|{y}^{k} - {x\|}^{2}\), which assures that it does not interfere with the subdifferential of ϕ k+1( ⋅,x) at x. We build ϕ k+1( ⋅,x) in such a way that it has m k ( ⋅,x) as an affine minorant.

Proposition 26.5.

Let \(\phi _{k+1}(\cdot,x) =\max \{ m_{\nu }(\cdot,x):\nu = 0,\ldots,k\}\) . Then \(\partial \phi _{k+1}(x,x) \subset \partial _{[\varepsilon ]}f(x)\) .

Proof.

As all the downshifts s k are positive, \(\phi _{k+1}(y,x) = m_{0}(y,x)\) in a neighborhood of x; hence \(\partial \phi _{k+1}(x,x) = \partial m_{0}(x,x) =\{ g(x)\} \subset \partial _{[\varepsilon ]}f(x)\). ■

26.3.3 Tangent Program

Given the local model \(\Phi _{k}(\cdot,x) =\phi _{k}(\cdot,x) + \frac{1} {2}{(\cdot - x)}^{\top }Q(x)(\cdot - x)\) at serious iterate x and inner loop counter k, we solve the tangent program

$$\displaystyle\begin{array}{rcl} \min _{y\in {\mathbb{R}}^{n}}\Phi _{k}(y,x) + \frac{\tau _{k}} {2} \|y - {x\|}^{2}.& &{}\end{array}$$
(26.7)

We assume that Q(x) +τ k I ≻ 0, which means (26.7) is strictly convex and has a unique solution y k, called a trial step. The optimality condition for (26.7) implies

$$\displaystyle\begin{array}{rcl} (Q(x) +\tau _{k}I)(x - {y}^{k}) \in \partial \phi _{ k}({y}^{k},x).& &{}\end{array}$$
(26.8)

If \(\phi _{k}(\cdot,x) =\max \{ m_{\nu }(\cdot,x):\nu = 0,\ldots,k\}\), with \(m_{\nu }(\cdot,x) = a_{\nu } + g_{\nu }^{\top }(\cdot - x)\), then we can find \(\lambda _{0} \geq 0,\ldots,\lambda _{k} \geq 0\), summing up to 1, such that

$$\displaystyle{g_{k}^{{\ast}}\,:=\,(Q(x) +\tau _{ k}I)(x - {y}^{k}) =\sum _{ \nu =0}^{k}\lambda _{ \nu }g_{\nu }.}$$

Traditionally, \(g_{k}^{{\ast}}\) is called the aggregate subgradient at y k. We build the aggregate plane

$$\displaystyle{m_{k}^{{\ast}}(\cdot,x) = a_{ k}^{{\ast}} + g_{ k}^{{\ast}\top }(\cdot - x),}$$

where \(a_{k}^{{\ast}} =\sum _{ \nu =1}^{k}\lambda _{\nu }a_{\nu }\). Keeping \(m_{k}^{{\ast}}(\cdot,x)\) as an affine minorant of ϕ k+1( ⋅,x) allows to drop some of the older cutting planes to avoid overflow. As \(\partial \phi _{k}({y}^{k},x)\) is the subdifferential of a max-function, we know that λ ν > 0 precisely for those m ν ( ⋅,x) which are active at y k. That is, \(\sum _{\nu =1}^{k}\lambda _{\nu }m_{\nu }({y}^{k},x) =\phi _{k}({y}^{k},x)\). Therefore the aggregate plane satisfies

$$\displaystyle\begin{array}{rcl} m_{k}^{{\ast}}({y}^{k},x) =\phi _{ k}({y}^{k},x).& &{}\end{array}$$
(26.9)

As our algorithm chooses ϕ k+1 such that \(m_{k}^{{\ast}}(\cdot,x) \leq \phi _{k+1}(\cdot,x)\), we have \(\phi _{k}({y}^{k},x) \leq \phi _{k+1}({y}^{k},x)\). All this follows the classical line originally proposed in Kiwiel [25]. Maintaining a model ϕ k ( ⋅,x) which contains aggregate subgradients from previous sweeps instead of all the older g ν , \(\nu = 0,\ldots,k\) does not alter the statement of Proposition 26.5 nor of formula (26.9).

26.3.4 Testing Acceptance

Having computed the kth trial step y k via (26.7), we have to decide whether it should be accepted as the new serious iterate x +. We compute the test quotient

$$\displaystyle{\rho _{k} = \frac{f(x) - f({y}^{k})} {f(x) - \Phi _{k}({y}^{k},x)}.}$$

Fixing constants 0 <γ < Γ < 1, we call y k bad if ρ k <γ and good if ρ k Γ. If y k is not bad, meaning ρ k γ, then it is accepted to become x +. We refer to this as a serious step. Here the inner loop ends. On the other hand, if y k is bad, then it is rejected and referred to as a null step. In this case the inner loop continues.

26.3.5 Management of τ in the Inner Loop

The most delicate point is the management of the proximity control parameter during the inner loop. Namely, it may turn out that the trial steps y k proposed by the tangent program (26.7) are too far from the current x, so that no decrease below f(x) can be achieved. In the convex case one relies entirely on the mechanism of cutting planes. Indeed, if y k is a null step, then the convex cutting plane, when added to model ϕ k+1( ⋅,x), will cut away the unsuccessful y k, paving the way for a better y k+1 at the next sweep.

The situation is more complicated without convexity, where cutting planes are no longer tangents to f. In the case of downshifted tangents the information stored in the ideal set of all theoretically available cutting planes may not be sufficient to represent f correctly when y k is far away from x. This is when we have to force smaller steps by increasing τ, i.e., by tightening proximity control. As a means to decide when this has to happen, we use the parameter

$$\displaystyle\begin{array}{rcl} \tilde{\rho }_{k} = \frac{f(x) - M_{k}({y}^{k},x)} {f(x) - \Phi _{k}({y}^{k},x)},& &{}\end{array}$$
(26.10)

where m k ( ⋅,x) is the new cutting plane drawn for y k as in Sect. 26.3.1 and \(M_{k}(\cdot,x) =\ m_{k}(\cdot,x) + \frac{1} {2}{(\cdot - x)}^{\top }Q(x)(\cdot - x)\). We fix a parameter \(\tilde{\gamma }\) with \(\gamma <\tilde{\gamma }< 1\) and make the following decision.

$$\displaystyle\begin{array}{rcl} \tau _{k+1} = \left \{\begin{array}{ll} 2\tau _{k}&\mbox{ if }\rho _{k} <\gamma \mbox{ and }\tilde{\rho }_{k} \geq \tilde{\gamma }, \\ \tau _{k} &\mbox{ if }\rho _{k} <\gamma \mbox{ and }\tilde{\rho }_{k} <\tilde{\gamma }. \end{array} \right.& &{}\end{array}$$
(26.11)

The idea in (26.11) can be explained as follows. The quotient \(\tilde{\rho }_{k}\) in (26.10) can also be written as \(\tilde{\rho }_{k} = \left (f(x) - \Phi _{k+1}({y}^{k},x)\right )/\left (f(x) - \Phi _{k}({y}^{k},x)\right )\), because the cutting plane at stage k will be integrated into model Φ k+1 at stage k + 1. If \(\tilde{\rho }_{k} \approx 1\), we can therefore conclude that adding the new cutting plane at the null step y k hardly changes the situation. Put differently, had we known the cutting plane before computing y k, the result would not have been much better. In this situation we decide to force smaller trial steps by increasing the τ-parameter. If on the other hand \(\tilde{\rho }_{k} \ll 1\), then the gain of information provided by the new cutting plane at y k is substantial with regard to the information already stored in Φ k . Here we continue to add cutting planes and aggregate planes only, hoping that we will still make progress without having to increase τ. The decision \(\tilde{\rho }_{k} \approx 1\) versus \(\tilde{\rho }_{k} \ll 1\) is formalized by the rule (26.11).

Remark 26.6.

By construction \(\tilde{\rho }_{k} \geq 0\), because aggregation assures that \(\phi _{k+1}({y}^{k},x) \geq \phi _{k}({y}^{k},x)\). Notice that in contrast ρ k may be negative. Indeed, ρ k < 0 means that the trial step y k proposed by the tangent program (26.7) gives no descent in the function values, meaning that it is clearly a bad step.

26.3.6 Management of τ in the Outer Loop

The proximity parameter τ will also be managed dynamically between serious steps xx +. In our algorithm we use a memory parameter \(\tau _{j}^{\sharp }\), which is specified at the end of the (j − 1)st inner loop and serves to initialize the jth inner loop with \(\tau _{1} =\tau _{ j}^{\sharp }\).

A first rule which we already mentioned is that we need \(Q({x}^{j}) +\tau _{k}I \succ 0\) for all k during the jth inner loop. Since τ is never decreased during the inner loop, we can assure this if we initialize \(\tau _{1} > -\lambda _{\min }(Q({x}^{j}))\).

A more important aspect is the following. Suppose the (j − 1)st inner loop ended at inner loop counter k j−1, i.e., \({x}^{j} = {y}^{k_{j-1}}\) with \(\rho _{k_{j-1}} \geq \gamma\). If acceptance was good, i.e., \(\rho _{k_{j-1}} \geq \Gamma \), then we can trust our model, and we account for this by storing a smaller parameter \(\tau _{j}^{\sharp } = \frac{1} {2}\tau _{k_{j-1}} <\tau _{k_{j-1}}\) for the jth outer loop. On the other hand, if acceptance of the (j − 1)st step was neither good nor bad, meaning \(\gamma \leq \rho _{k_{j-1}} \leq \Gamma \), then there is no reason to decrease τ for the next outer loop, so we memorize \(\tau _{k_{j-1}}\), the value we had at the end of the (j − 1)st inner loop. Altogether

$$\displaystyle\begin{array}{rcl} \tau _{j}^{\sharp } = \left \{\begin{array}{ll} \max \{\frac{1} {2}\tau _{k_{j-1}},-\lambda _{\min }(Q({x}^{j}))+\zeta \}&\mbox{ if }\rho _{k_{j-1}} \geq \Gamma, \\ \max \{\tau _{k_{j-1}},-\lambda _{\min }(Q({x}^{j}))+\zeta \} &\mbox{ if }\gamma \leq \rho _{k_{j-1}} < \Gamma, \end{array} \right.& &{}\end{array}$$
(26.12)

where ζ > 0 is some small threshold fixed once and for all.

26.3.7 Recycling of Planes

In a convex bundle algorithm one keeps in principle all cutting planes in the model, using aggregation to avoid overflow. In the nonconvex case this is no longer possible. Cutting planes are downshifted tangents, which links them to the value f(x) of the current iterate x. As we pass from x to a new serious iterate x +, the cutting plane \(m_{z,g}(\cdot,x) = a + {g}^{\top }(\cdot - x)\) with \(g \in \partial _{[\varepsilon ]}f(z)\) for some z cannot be used as such, because we have no guarantee whether \(a + {g}^{\top }({x}^{+} - x) \leq f({x}^{+})\). But we can downshift it again if need be. We recycle the plane as

$$\displaystyle{m_{z,g}(\cdot,{x}^{+}) = a - {s}^{+} + {g}^{\top }(\cdot - x),\quad {s}^{+} = [m_{ z,g}({x}^{+},x) - f({x}^{+})]_{ +} + c\|{x}^{+} - {z\|}^{2}.}$$

In addition one may also apply a test whether z is too far from x + to be of interest, in which case the plane should simply be removed from the stock.

26.4 Algorithm

26.5 Analysis of the Inner Loop

In this section we analyze the inner loop and show that there are two possibilities. Either the inner loop terminates finitely with a step \({x}^{+} = {y}^{k}\) satisfying ρ k γ or we get an infinite sequence of null steps y k which converges to x. In the latter case, we conclude that \(0 \in \partial _{[\tilde{\varepsilon }]}f(x)\), i.e., that x is approximate optimal.

Suppose the inner loop turns forever. Then there are two possibilities. Either τ k is increased infinitely often, so that τ k , or τ k is frozen, \(\tau _{k} =\tau _{k_{0}}\) for some k 0 and all kk 0. These scenarios will be analyzed in Lemmas 26.9 and 26.11. Since the matrix Q(x) is fixed during the inner loop, we write it simply as Q.

To begin with, we need an auxiliary construction. We define the following convex function:

$$\displaystyle\begin{array}{rcl} \phi (y,x) =\sup \{ m_{z,g}(y,x): z \in B(0,M),g \in \partial _{[\varepsilon ]}f(y)\},& &{}\end{array}$$
(26.13)

where B(0,M) is a fixed ball large enough to contain x and all trial steps encountered during the inner loop. Recall that m z,g ( ⋅,x) is the cutting plane at z with approximate subgradient g [ɛ] f(z) with respect to the serious iterate x. Due to boundedness of B(0,M), ϕ( ⋅,x) is defined everywhere.

Lemma 26.7.

We have ϕ(x,x) = f(x), \(\partial \phi (x,x) = \partial _{[\varepsilon ]}f(x)\) , and ϕ is jointly upper semicontinuous. Moreover, if y k ∈ B(0,M) for all k, then \(\phi _{k}(\cdot,x) \leq \phi (\cdot,x)\) for every first-order working model ϕ k .

Proof.

  1. (1)

    The first statement follows because every cutting plane drawn at some zx and g [ɛ] f(z) satisfies \(m_{z,g}(x,x) \leq f(x) - c\|x - {z\|}^{2} < f(x)\), while cutting planes at x obviously have m x,g (x,x) = f(x).

  2. (2)

    Concerning the second statement, let us first prove [ɛ] f(x) ⊂ ∂ ϕ(x,x). We consider the set of limiting subgradients

    $$\displaystyle{{\partial }^{l}f(x) =\{\lim _{ k\rightarrow \infty }\nabla f({y}^{k}): {y}^{k} \rightarrow x,\mbox{ $f$ is differentiable at ${y}^{k}$}\}.}$$

    Then co  l f(x) = ∂ f(x) by [13]. It therefore suffices to show l f(x) +ɛ B∂ ϕ(x,x), because ∂ ϕ(x,x) is convex and we then have \(\partial \phi (x,x) \supset \text{co}({\partial }^{l}f(x) +\varepsilon B) = \text{co}\,{\partial }^{l}f(x) +\varepsilon B = \partial f(x) +\varepsilon B\).

    Let \(g_{a} \in {\partial }^{l}f(x) +\varepsilon B\). We have to show g a ∂ ϕ(x,x). Choose g l f(x) such that \(\|g - g_{a}\| \leq \varepsilon\). Pick a sequence y kx and \(g_{k} = \nabla f({y}^{k}) \in \partial f({y}^{k})\) such that g k g. Let \(g_{a,k} = g_{k} + g_{a} - g\) and then \(g_{a,k} \in \partial _{[\varepsilon ]}f({y}^{k})\) and g a,k g a . Let m k ( ⋅,x) be the cutting plane drawn at y k with approximate subgradient g a,k , then \(m_{k}({y}^{k},x) \leq \phi ({y}^{k},x)\). By the definition of the downshift process

    $$\displaystyle{m_{k}(y,x) = f({y}^{k}) + g_{ a,k}^{\top }(y - {y}^{k}) - s_{ k},}$$

    where s k is the downshift (26.6). There are two cases, \(s_{k} = c\|{y}^{k} - {x\|}^{2}\), and \(s_{k} = t_{k}(x) - f(x) + c\|{y}^{k} - {x\|}^{2}\) according to whether the term \([\ldots ]_{+}\) in (26.6) equals zero or not.

    Let us start with the second case, where t k (x) > f(x). Then \(s_{k} = f({y}^{k}) + g_{a,k}^{\top }(x - {y}^{k}) - f(x) + c\|{y}^{k} - {x\|}^{2}\) and

    $$\displaystyle\begin{array}{rcl} m_{k}(y,x)& =& f({y}^{k}) + g_{ a,k}^{\top }(y - {y}^{k}) - f({y}^{k}) - g_{ a,k}^{\top }(x - {y}^{k}) + f(x) - c\|{y}^{k} - {x\|}^{2} {}\\ & =& f(x) + g_{a,k}^{\top }(y - x) - c\|{y}^{k} - {x\|}^{2}. {}\\ \end{array}$$

    Therefore

    $$\displaystyle{\phi (y,x) -\phi (x,x) \geq m_{k}(y,x) - f(x) = g_{a,k}^{\top }(y - x) - c\|{y}^{k} - {x\|}^{2}.}$$

    Passing to the limit using y kx and \(g_{a,k} \rightarrow g_{a}\) proves g a ∂ ϕ(x,x).

    It remains to discuss the first case, where t k (x) ≤ f(x), so that \(s_{k} = c\|{y}^{k} - {x\|}^{2}\). Then

    $$\displaystyle{m_{k}(\cdot,x) = f({y}^{k}) + g_{ a,k}^{\top }(\cdot - {y}^{k}) - c\|{y}^{k} - {x\|}^{2}.}$$

    Therefore

    $$\displaystyle\begin{array}{rcl} \phi (y,x) -\phi (x,x)& \geq & m_{k}(y,x) - f(x) {}\\ & =& f({y}^{k}) - f(x) + g_{ a,k}^{\top }(y - {y}^{k}) - c\|{y}^{k} - {x\|}^{2} {}\\ & =& f({y}^{k}) - f(x) + g_{ a,k}^{\top }(x - {y}^{k}) + g_{ a,k}^{\top }(y - x) - c\|{y}^{k} - {x\|}^{2}. {}\\ \end{array}$$

    As y is arbitrary, we have \(g_{a,k} \in \partial _{\vert \zeta _{k}\vert }\phi (x,x)\), where \(\zeta _{k} = f({y}^{k}) - f(x) + g_{a,k}^{\top }(x - {y}^{k}) - c\|{y}^{k} - {x\|}^{2}\). Since ζ k → 0, y kx and g a,k g a , we deduce again g a ∂ ϕ(x,x). Altogether for the two cases \([\ldots ]_{+} = 0\) and \([\ldots ]_{+} > 0\) we have shown l f(x) +ɛ B∂ ϕ(x,x).

  3. (3)

    Let us now prove ∂ ϕ(x,x) ⊂ ∂ f(x) +ɛ B. Let g∂ ϕ(x,x) and \(m(\cdot,x) = f(x) + {g}^{\top }(\cdot - x)\) the tangent plane to the graph of ϕ( ⋅,x) at x associated with g. By convexity m( ⋅,x) ≤ϕ( ⋅,x). We fix \(h \in {\mathbb{R}}^{n}\) and consider the values ϕ(x + th,x) for t > 0. According to the definition of ϕ( ⋅,x) we have \(\phi (x + th,x) = m_{z_{t},g_{t}}(x + th,x)\), where \(m_{z_{t},g_{t}}(\cdot,x)\) is a cutting plane drawn at some z t B(0,M) with \(g_{t} \in \partial _{[\varepsilon ]}f(z_{t})\). The slope of the cutting plane along the ray \(x + \mathbb{R}_{+}h\) is \(g_{t}^{\top }h\). Now the cutting plane passes through \(\phi (x + th,x) \geq m(x + th,x)\), which means that its value at x + th is above the value of the tangent. On the other hand, according to the downshift process, the cutting plane satisfies \(m_{z_{t},g_{t}}(x,x) \leq f(x) - c\|x - z{_{t}\|}^{2}\). Its value at x is therefore below the value of m(x,x) = f(x). These two facts together tell us that \(m_{z_{t},g_{t}}(\cdot,x)\) is steeper than m( ⋅,x) along the ray \(x + \mathbb{R}_{+}h\). In other words, \({g}^{\top }h \leq g_{t}^{\top }h\). Next observe that \(\phi (x + th,x) \rightarrow \phi (x,x) = f(x)\) as t → 0+. That implies \(m_{z_{t},g_{t}}(x + th,x) \rightarrow f(x)\). Since by the definition of downshift \(m_{z_{t},g_{t}}(x + th,x) \leq f(x) - c\|x - z{_{t}\|}^{2}\), it follows that we must have \(\|x - z{_{t}\|}^{2} \rightarrow 0\), i.e., z t x as t → 0+. Passing to a subsequence, we may assume \(g_{t} \rightarrow \hat{ g}\) for some \(\hat{g}\). With z t x it follows from upper semicontinuity of the Clarke subdifferential that \(\hat{g} \in \partial _{[\varepsilon ]}f(x)\). On the other hand, \({g}^{\top }h \leq g_{t}^{\top }h\) for all t implies \({g}^{\top }h \leq \hat{ {g}}^{\top }h\). Therefore \({g}^{\top }h \leq \sigma _{K}(h) =\max \{\tilde{ {g}}^{\top }h:\tilde{ g} \in K\}\), where σ K is the support function of \(K = \partial _{[\varepsilon ]}f(x)\). Given that h was arbitrary, and as K is closed convex, this implies gK by Hahn–Banach.

  4. (4)

    Upper semicontinuity of ϕ follows from upper semicontinuity of the Clarke subdifferential. Indeed, let x j x, y j y. Using the definition (26.13) of ϕ, find cutting planes \(m_{z_{j},g_{j}}(\cdot,x_{j}) = t_{z_{j}}(\cdot ) - s_{j}\) at serious iterate x j , drawn at z j with \(g_{j} \in \partial _{[\varepsilon ]}f(z_{j})\), such that \(\phi (y_{j},x_{j}) \leq m_{z_{j},g_{j}}(y_{j},x_{j}) +\varepsilon _{j}\) and ɛ j → 0. We have \(t_{z_{j}}(y) = f(z_{j}) + g_{j}^{\top }(y - z_{j})\). Passing to a subsequence, we may assume z j z and \(g_{j} \rightarrow g \in \partial _{[\varepsilon ]}f(z)\). That means \(t_{z_{j}}(\cdot ) \rightarrow t_{z}(\cdot )\), and since y j y also \(t_{z_{j}}(y_{j}) \rightarrow t_{z}(y)\). In order to conclude for the \(m_{z_{j},g_{j}}(\cdot,x_{j})\) we have to see how the downshift behaves. We have indeed s j s, where s is the downshift at z with respect to the approximate subgradient g and serious iterate x. Therefore \(m_{z,g}(\cdot,x) = t_{z}(\cdot ) - s\). This shows \(m_{z_{j},g_{j}}(\cdot,x_{j}) = t_{z_{j}}(\cdot ) - s_{j} \rightarrow t_{z}(\cdot ) - s = m_{z,g}(\cdot,x)\) as j, and then also \(m_{z_{j},g_{j}}(y_{j},x_{j}) = t_{z_{j}}(y_{j}) - s_{j} \rightarrow t_{z}(y) - s = m_{z,g}(y,x)\), where uniformity comes from boundedness of the g j . This implies \(\lim m_{z_{j},g_{j}}(y_{j},x_{j}) = m_{z,g}(y,x) \leq \phi (y,x)\) as required.

  5. (5)

    The inequality ϕ k ϕ is clear, because ϕ k ( ⋅,x) is built from cutting planes m k ( ⋅,x), and all these cutting planes are below the envelope ϕ( ⋅,x).

Remark 26.8.

In [40, 42] the case ɛ = 0 is discussed and a function ϕ( ⋅,x) with the properties in Lemma 26.7 is called a first-order model of f at x. It can be understood as a generalized first-order Taylor expansion of f at x. Every locally Lipschitz function f has the standard or Clarke model \({\phi }^{\sharp }(y,x) = f(x) + {f}^{0}(x,y - x)\), where f 0(x,d) is the Clarke directional derivative at x. In the present situation it is reasonable to call ϕ( ⋅,x) an ɛ-model of f at x.

Following [34] a function f is called ɛ-convex on an open convex set U if \(f(tx + (1 - t)y) \leq tf(x) + (1 - t)f(y) +\varepsilon t(1 - t)\|x - y\|\) for all x,yU and 0 ≤ t ≤ 1. Every ɛ-convex function satisfies \(f{^\prime}(y,x - y) \leq f(x) - f(y) +\varepsilon \| x - y\|\); hence for g∂ f(y),

$$\displaystyle\begin{array}{rcl}{ g}^{\top }(x - y) \leq f(x) - f(y) +\varepsilon \| x - y\|.& &{}\end{array}$$
(26.14)

A function f is called approximate convex if for every x and ɛ > 0 there exists δ > 0 such that f is ɛ-convex on B(x,δ). Using results from [14, 34] one may show that approximate convex functions coincide with lower C 1 function in the sense of Spingarn [51].

Lemma 26.9.

Suppose the inner loop turns forever and τ k →∞.

  1. 1.

    If f is ɛ′-convex on a set containing all y k , k ≥ k 0 , then \(0 \in \partial _{[\tilde{\varepsilon }]}f(x)\) , where \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime}+\varepsilon )/(\tilde{\gamma }-\gamma )\) .

  2. 2.

    If f is lower C 1 , then 0 ∈ ∂ [αɛ] f(x), where \(\alpha = 1 + {(\tilde{\gamma }-\gamma )}^{-1}\) .

Proof.

 

  1. (i)

    The second statement follows from the first, because every lower C 1 function is approximate convex, hence ɛ′-convex on a suitable neighborhood of x. We therefore concentrate on the first statement.

  2. (ii)

    By assumption none of the trial steps is accepted, so that ρ k <γ for all \(k \in \mathbb{N}\). Since τ k is increased infinitely often, there are infinitely many inner loop instances k where \(\tilde{\rho }_{k} \geq \tilde{\gamma }\). Let us prove that under these circumstances y kx. Recall that \(g_{k}^{{\ast}} = (Q +\tau _{k}I)(x - {y}^{k}) \in \partial \phi _{k}({y}^{k},x)\). By the subgradient inequality this gives

    $$\displaystyle\begin{array}{rcl} g_{k}^{{\ast}\top }(x - {y}^{k}) \leq \phi _{ k}(x,x) -\phi _{k}({y}^{k},x).& & {}\end{array}$$
    (26.15)

    Now use ϕ k (x,x) = f(x) and observe that \(m_{0}({y}^{k},x) \leq \phi _{k}({y}^{k},x)\), where m 0( ⋅,x) is the exactness plane. Since \(m_{0}(y,x) = f(x) + g{(x)}^{\top }(y - x)\) for some g(x) ∈ [ɛ] f(x), expanding the term on the left of (26.15) gives

    $$\displaystyle\begin{array}{rcl}{ (x - {y}^{k})}^{\top }(Q +\tau _{ k}I)(x - {y}^{k}) \leq g{(x)}^{\top }(x - {y}^{k}) \leq \| g(x)\|\|x - {y}^{k}\|.& & {}\end{array}$$
    (26.16)

    Since τ k , the term on the left-hand side of (26.16) behaves asymptotically like \(\tau _{k}\|x - {y{}^{k}\|}^{2}\). Dividing (26.16) by \(\|x - {y}^{k}\|\) therefore shows that \(\tau _{k}\|x - {y}^{k}\|\) is bounded by \(\|g(x)\|\). As τ k , this could only mean y kx.

  3. (iii)

    Let us use y kx and go back to formula (26.15). Since the left hand side of (26.15) tends to 0 and ϕ k (x,x) = f(x), we see that the limit superior of \(\phi _{k}({y}^{k},x)\) is f(x). On the other hand, \(\phi _{k}({y}^{k},x) \geq m_{0}({y}^{k},x)\), where m 0( ⋅,x) is the exactness plane. Since clearly \(m_{0}({y}^{k},x) \rightarrow m_{0}(x,x) = f(x)\), the limit inferior is also f(x), and we conclude that \(\phi _{k}({y}^{k},x) \rightarrow f(x)\).

    Keeping this in mind, let us use the subgradient inequality (26.15) again and subtract a term \(\frac{1} {2}{(x - {y}^{k})}^{\top }Q(x - {y}^{k})\) from both sides. That gives the estimate

    $$\displaystyle{\frac{1} {2}{(x - {y}^{k})}^{\top }Q(x - {y}^{k}) +\tau _{ k}\|x - {y{}^{k}\|}^{2} \leq f(x) - \Phi _{ k}({y}^{k},x).}$$

    Fix 0 <ζ < 1. Using τ k we have

    $$\displaystyle{(1-\zeta )\tau _{k}\|x - {y}^{k}\| \leq \| g_{ k}^{{\ast}}\|\leq (1+\zeta )\tau _{ k}\|x - {y}^{k}\|}$$

    and also

    $$\displaystyle{\frac{1} {2}{(x - {y}^{k})}^{\top }Q(x - {y}^{k}) +\tau _{ k}\|x - {y{}^{k}\|}^{2} \geq (1-\zeta )\tau _{ k}\|x - {y{}^{k}\|}^{2}}$$

    for sufficiently large k. Therefore,

    $$\displaystyle\begin{array}{rcl} f(x) - \Phi _{k}({y}^{k},x) \geq \frac{1-\zeta } {1+\zeta }\|g_{k}^{{\ast}}\|\|x - {y}^{k}\|& & {}\end{array}$$
    (26.17)

    for k large enough.

  4. (iv)

    Now let \(\eta _{k}\,:=\,\text{dist}\left (g_{k}^{{\ast}},\partial \phi (x,x)\right )\). We argue that η k → 0. Indeed, using the subgradient inequality at y k in tandem with \(\phi (\cdot,x) \geq \phi _{k}(\cdot,x)\), we have for all \(y \in {\mathbb{R}}^{n}\)

    $$\displaystyle{\phi (y,x) \geq \phi _{k}({y}^{k},x) +{ g_{ k}^{{\ast}}}^{\top }(y - {y}^{k}).}$$

    Here our upper envelope function (26.13) is defined such that the ball B(0,M) contains x and all trial points y k at which cutting planes are drawn.

    Since the subgradients \(g_{k}^{{\ast}}\) are bounded by part (ii), there exists an infinite subsequence \(\mathcal{N} \subset \mathbb{N}\) such that \(g_{k}^{{\ast}}\rightarrow {g}^{{\ast}}\), \(k \in \mathcal{N}\), for some g . Passing to the limit \(k \in \mathcal{N}\) and using y kx and \(\phi _{k}({y}^{k},x) \rightarrow f(x) =\phi (x,x)\), we have \(\phi (y,x) \geq \phi (x,x) + {g}^{{\ast}\top }(y - x)\) for all y. Hence g ∂ ϕ(x,x), which means \(\eta _{k} = \text{dist}(g_{k}^{{\ast}},\partial \phi (x,x)) \leq \| g_{k}^{{\ast}}- {g}^{{\ast}}\|\rightarrow 0\), \(k \in \mathcal{N}\), proving the argument.

  5. (v)

    Using the definition of η k , choose \(\tilde{g}_{k} \in \partial \phi (x,x)\) such that \(\|g_{k}^{{\ast}}-\tilde{ g}_{k}\| =\eta _{k}\). Now let dist\(\left (0,\partial \phi (x,x)\right ) =\eta\). Then \(\|\tilde{g}_{k}\| \geq \eta\) for all \(k \in \mathcal{N}\). Hence \(\|g_{k}^{{\ast}}\|\geq \eta -\eta _{k} > (1-\zeta )\eta\) for \(k \in \mathcal{N}\) large enough, given that η k → 0 by (iv). Going back with this to (26.17) we deduce

    $$\displaystyle\begin{array}{rcl} f(x) - \Phi _{k}({y}^{k},x) \geq \frac{{(1-\zeta )}^{2}} {1+\zeta } \eta \|x - {y}^{k}\|& & {}\end{array}$$
    (26.18)

    for \(k \in \mathcal{N}\) large enough.

  6. (vi)

    We claim that \(f({y}^{k}) \leq M_{k}({y}^{k},x) + (1+\zeta )(\varepsilon {^\prime}+\varepsilon )\|x - {y}^{k}\|\) for all k sufficiently large. Indeed, we have \(m_{k}(\cdot,x) = t_{k}(\cdot ) - s_{k}\), where s k is the downshift of the approximate tangent t k ( ⋅) at y k, \(g_{\varepsilon k} \in \partial _{[\varepsilon ]}f({y}^{k})\), with regard to the serious iterate x. There are two cases. Assume first that t k (x) > f(x). Then

    $$\displaystyle\begin{array}{rcl} m_{k}(y,x)& =& f({y}^{k}) + g_{\varepsilon k}^{\top }(y - {y}^{k}) - s_{ k} {}\\ & =& f({y}^{k}) + g_{\varepsilon k}^{\top }(y - {y}^{k}) - c\|x - {y{}^{k}\|}^{2} - t_{ k}(x) + f(x) {}\\ & =& f(x) + g_{\varepsilon k}^{\top }(y - x) - c\|x - {y{}^{k}\|}^{2}. {}\\ \end{array}$$

    In consequence

    $$\displaystyle\begin{array}{rcl} f({y}^{k})-m_{ k}({y}^{k},x)& =& f({y}^{k})-f(x)-g_{\varepsilon k}^{\top }({y}^{k}-x)+c\|x-{y{}^{k}\|}^{2} {}\\ & =& f({y}^{k})-f(x)-g_{ k}^{\top }({y}^{k}-x)+{(g_{ k}-g_{\varepsilon k})}^{\top }(x-{y}^{k})+c\|x-{y{}^{k}\|}^{2}. {}\\ \end{array}$$

    Now since f is ɛ′-convex, estimate (26.14) is valid under the form

    $$\displaystyle{g_{k}^{\top }(x - {y}^{k}) \leq f(x) - f({y}^{k}) +\varepsilon {^\prime}\|x - {y}^{k}\|.}$$

    We therefore get

    $$\displaystyle{f({y}^{k}) - m_{ k}({y}^{k},x) \leq (\varepsilon {^\prime}+\varepsilon )\|x - {y}^{k}\| + c\|x - {y{}^{k}\|}^{2}.}$$

    Subtracting a term \(\frac{1} {2}{(x - {y}^{k})}^{\top }Q(x - {y}^{k})\) on both sides gives

    $$\displaystyle{f({y}^{k}) - M_{ k}({y}^{k},x) \leq (\varepsilon {^\prime} +\varepsilon +\nu _{ k})\|x - {y}^{k}\|,}$$

    where \(\nu _{k}:= c\|x - {y{}^{k}\|}^{2} -\frac{1} {2}{(x - {y}^{k})}^{\top }Q(x - {y}^{k}) \rightarrow 0\) and \(M_{k}(y,x) = m_{k}(y,x) + \frac{1} {2}{(y - x)}^{\top }Q(y - x)\). Therefore

    $$\displaystyle\begin{array}{rcl} f({y}^{k}) - M_{ k}({y}^{k},x) \leq (1+\zeta )(\varepsilon {^\prime}+\varepsilon )\|x - {y}^{k}\|& & {}\end{array}$$
    (26.19)

    for k large enough.

    Now consider the second case t k (x) ≤ f(x). Here we get an even better estimate than (26.19), because \(s_{k} = c\|x - {y{}^{k}\|}^{2}\), so that \(f({y}^{k}) - m_{k}({y}^{k},x) = c\|x - {y{}^{k}\|}^{2} \leq \varepsilon \| x - {y}^{k}\|\) for k large enough.

  7. (vii)

    To conclude, using (26.18) and (26.19) we expand the coefficient \(\tilde{\rho }_{k}\) as

    $$\displaystyle\begin{array}{rcl} \tilde{\rho }_{k}& =& \rho _{k} + \frac{f({y}^{k}) - M_{k}({y}^{k},x)} {f(x) - \Phi _{k}({y}^{k},x)} {}\\ & \leq &\rho _{k} + \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )\|x - {y}^{k}\|} {{(1-\zeta )}^{2}\eta \|x - {y}^{k}\|} =\rho _{k} + \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}\eta }. {}\\ \end{array}$$

    This shows

    $$\displaystyle{\eta < \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}(\tilde{\gamma }-\gamma )}.}$$

    For suppose we had \(\eta \geq \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}(\tilde{\gamma }-\gamma )}\), then \(\tilde{\rho }_{k} \leq \rho _{k} + (\tilde{\gamma }-\gamma ) \leq \tilde{\gamma }\) for all k, contradicting \(\tilde{\rho }_{k} >\tilde{\gamma }\) for infinitely many k. As 0 <ζ < 1 was arbitrary, we have the estimate \(\eta \leq \frac{\varepsilon {^\prime}+\varepsilon } {\tilde{\gamma }-\gamma }\). Since \(\partial \phi (x,x) = \partial f(x) +\varepsilon B\) by Lemma 26.7, we deduce \(0 \in \partial \phi (x,x) +\eta B \subset \partial f(x) + (\varepsilon +\eta )B\), and this is the result claimed in statement 1.

Remark 26.10.

Suppose we choose γ very small and \(\tilde{\gamma }\) close to 1, then \(\alpha = 2+\xi\) for some small ξ, so roughly α ≈ 2.

Lemma 26.11.

Suppose the inner loop turns forever and τ k is frozen. Then y k → x and 0 ∈ ∂ [ɛ] f(x).

Proof.

 

  1. (i)

    The control parameter is frozen from counter k 0 onwards, and we put τ := τ k , kk 0. This means that ρ k <γ and \(\tilde{\rho }_{k} <\tilde{\gamma }\) for all kk 0.

  2. (ii)

    We prove that the sequence of trial steps y k is bounded. Notice that

    $$\displaystyle{g_{k}^{{\ast}\top }(x - {y}^{k}) \leq \phi _{ k}(x,x) -\phi _{k}({y}^{k},x)}$$

    by the subgradient inequality at y k and the definition of the aggregate subgradient. Now observe that ϕ k (x,x) = f(x) and \(\phi _{k}({y}^{k},x) \geq m_{0}({y}^{k},x)\). Therefore, using the definition of \(g_{k}^{{\ast}}\), we have

    $$\displaystyle{{(x - {y}^{k})}^{\top }(Q +\tau I)(x - {y}^{k}) \leq f(x) - m_{ 0}({y}^{k},x) = g{(x)}^{\top }(x - {y}^{k}) \leq \| g(x)\|\|x - {y}^{k}\|.}$$

    Since the τ-parameter is frozen and Q +τ I ≻ 0, the expression on the left is the square \(\|x - {y}^{k}\|_{Q+\tau I}^{2}\) of the Euclidean norm derived from Q +τ I. Since both norms are equivalent, we deduce after dividing by \(\|x - {y}^{k}\|\) that \(\|x - {y}^{k}\|_{Q+\tau I} \leq C\|g(x)\|\) for some constant C > 0 and all k. This proves the claim.

  3. (iii)

    Let us introduce the objective function of tangent program (26.7) for kk 0:

    $$\displaystyle{\psi _{k}(\cdot,x) =\phi _{k}(\cdot,x) + \frac{1} {2}{(\cdot - x)}^{\top }(Q +\tau I)(\cdot - x).}$$

    Let \(m_{k}^{{\ast}}(\cdot,x)\) be the aggregate plane, then \(\phi _{k}({y}^{k},x) = m_{k}^{{\ast}}({y}^{k},x)\) by (26.9) and therefore also

    $$\displaystyle{\psi _{k}({y}^{k},x) = m_{ k}^{{\ast}}({y}^{k},x) + \frac{1} {2}{({y}^{k} - x)}^{\top }(Q +\tau I)({y}^{k} - x).}$$

    We introduce the quadratic function \(\psi _{k}^{{\ast}}(\cdot,x) = m_{k}^{{\ast}}(\cdot,x) + \frac{1} {2}{(\cdot - x)}^{\top }(Q +\tau I)(\cdot - x).\) Then

    $$\displaystyle\begin{array}{rcl} \psi _{k}({y}^{k},x) =\psi _{ k}^{{\ast}}({y}^{k},x)& & {}\end{array}$$
    (26.20)

    by what we have just seen. By construction of model ϕ k+1( ⋅,x) we have \(m_{k}^{{\ast}}(y,x) \leq \phi _{k+1}(y,x)\), so that

    $$\displaystyle\begin{array}{rcl} \psi _{k}^{{\ast}}(y,x) \leq \psi _{ k+1}(y,x).& & {}\end{array}$$
    (26.21)

    Notice that \(\nabla \psi _{k}^{{\ast}}(y,x) = \nabla m_{k}^{{\ast}}(y,x) + (Q +\tau I)(y - x) = g_{k}^{{\ast}} + (Q +\tau I)(y - x)\), so that \(\nabla \psi _{k}^{{\ast}}({y}^{k},x) = 0\) by (26.8). We therefore have the relation

    $$\displaystyle\begin{array}{rcl} \psi _{k}^{{\ast}}(y,x) =\psi _{ k}^{{\ast}}({y}^{k},x) + \frac{1} {2}{(y - {y}^{k})}^{\top }(Q +\tau I)(y - {y}^{k}),& & {}\end{array}$$
    (26.22)

    which is obtained by Taylor expansion of \(\psi _{k}^{{\ast}}(\cdot,x)\) at y k. Recall that step 8 of the algorithm assures Q +τ I ≻ 0, so that the quadratic expression defines the Euclidean norm \(\|\cdot \|_{Q+\tau I}\).

  4. (iv)

    From the previous point (iii) we now have

    $$\displaystyle{ \begin{array}{rclll} \psi _{k}({y}^{k},x)& \leq &\psi _{k}^{{\ast}}({y}^{k},x) + \frac{1} {2}\|{y}^{k} - {y}^{k+1}\|_{ Q+\tau I}^{2} & & \mbox{ [using (26.20)]} \\ & =&\psi _{k}^{{\ast}}({y}^{k+1},x) &&\mbox{ [using (26.22)] } \\ & \leq &\psi _{k+1}({y}^{k+1},x) &&\mbox{ [using (26.21)]} \\ & \leq &\psi _{k+1}(x,x) &&\mbox{ (${y}^{k+1}$ minimizer of $\psi _{k+1}$)} \\ & =&\phi _{k+1}(x,x) = f(x).\end{array} }$$
    (26.23)

    We deduce that the sequence \(\psi _{k}({y}^{k},x)\) is monotonically increasing and bounded above by f(x). It therefore converges to some value ψ f(x).

    Going back to (26.23) with this information shows that the term \(\frac{1} {2}\|{y}^{k} - {y}^{k+1}\|_{ Q+\tau I}^{2}\) is squeezed in between two convergent terms with the same limit, ψ , which implies \(\frac{1} {2}\|{y}^{k} - {y}^{k+1}\|_{ Q+\tau I}^{2} \rightarrow 0\). Consequently, \(\|{y}^{k} - x\|_{Q+\tau I}^{2} -\| {y}^{k+1} - x\|_{Q+\tau I}^{2}\) also tends to 0, because the sequence of trial steps y k is bounded by part (ii).

    Recalling \(\phi _{k}(y,x) =\psi _{k}(y,x) -\frac{1} {2}\|y - x\|_{Q+\tau I}^{2}\), we deduce, using both convergence results, that

    $$\displaystyle\begin{array}{rcl} & & \phi _{k+1}({y}^{k+1},x) -\phi _{ k}({y}^{k},x) \\ & & \quad =\psi _{k+1}({y}^{k+1},x) -\psi _{ k}({y}^{k},x) -\frac{1} {2}\|{y}^{k+1} - x\|_{ Q+\tau I}^{2} + \frac{1} {2}\|{y}^{k} - x\|_{ Q+\tau I}^{2} \rightarrow 0. {}\end{array}$$
    (26.24)
  5. (v)

    We want to show that \(\phi _{k}({y}^{k},x) -\phi _{k+1}({y}^{k},x) \rightarrow 0\) and then of course also \(\Phi _{k}({y}^{k},x) - \Phi _{k+1}({y}^{k},x) \rightarrow 0\).

    Recall that by construction the cutting plane m k ( ⋅,x) is an affine support function of ϕ k+1( ⋅,x) at y k. By the subgradient inequality this implies

    $$\displaystyle\begin{array}{rcl} g_{k}^{\top }(y - {y}^{k}) \leq \phi _{ k+1}(y,x) -\phi _{k+1}({y}^{k},x)& & {}\end{array}$$
    (26.25)

    for all y. Therefore

    $$\displaystyle\begin{array}{rcl} 0& \leq &\phi _{k+1}({y}^{k},x) -\phi _{ k}({y}^{k},x)\quad \qquad \qquad \qquad \qquad \qquad \mbox{ (using aggregation)} {}\\ & =& \phi _{k+1}({y}^{k},x) + g_{ k}^{\top }({y}^{k+1} - {y}^{k}) -\phi _{ k}({y}^{k},x) - g_{ k}^{\top }({y}^{k+1} - {y}^{k}) {}\\ & \leq &\phi _{k+1}({y}^{k+1},x) -\phi _{ k}({y}^{k},x) +\| g_{ k}\|\|{y}^{k+1} - {y}^{k}\|\qquad \mbox{ [using (26.25)]} {}\\ \end{array}$$

    and this term converges to 0, because of (26.24), because the g k are bounded, and because \({y}^{k} - {y}^{k+1} \rightarrow 0\) according to part (iv) above. Boundedness of the g k follows from boundedness of the trial steps y k shown in part (ii). Indeed, \(g_{k} \in \partial f({y}^{k}) +\varepsilon B\), and the subdifferential of f is uniformly bounded on the bounded set \(\{{y}^{k}: k \in \mathbb{N}\}\). We deduce that \(\phi _{k+1}({y}^{k},x) -\phi _{k}({y}^{k},x) \rightarrow 0\). Obviously, that also gives \(\Phi _{k+1}({y}^{k},x) - \Phi _{k}({y}^{k},x) \rightarrow 0\).

  6. (vi)

    We now proceed to prove \(\Phi _{k}({y}^{k},x) \rightarrow f(x)\) and then also \(\Phi _{k+1}({y}^{k},x) \rightarrow f(x)\). Assume this is not the case, then \(\limsup _{k\rightarrow \infty }f(x) - \Phi _{k}({y}^{k},x) =:\eta > 0\). Choose δ > 0 such that \(\delta < (1-\tilde{\gamma })\eta\). It follows from (v) above that there exists \(k_{1} \geq k_{0}\) such that

    $$\displaystyle\begin{array}{rcl} \Phi _{k+1}({y}^{k},x)-\delta \leq \Phi _{ k}({y}^{k},x)& & {}\\ \end{array}$$

    for all kk 1. Using \(\tilde{\rho }_{k} \leq \tilde{\gamma }\) for all kk 0 then gives

    $$\displaystyle\begin{array}{rcl} \tilde{\gamma }\left (\Phi _{k}({y}^{k},x) - f(x)\right )& \leq & \Phi _{ k+1}({y}^{k},x) - f(x) \leq \Phi _{ k}({y}^{k},x) +\delta -f(x). {}\\ \end{array}$$

    Passing to the limit implies \(-\tilde{\gamma }\eta \leq -\eta +\delta\), contradicting the choice of δ. This proves η = 0.

  7. (vii)

    Having shown \(\Phi _{k}({y}^{k},x) \rightarrow f(x)\) and therefore also \(\Phi _{k+1}({y}^{k},x) \rightarrow f(x)\), we now argue that y kx. This follows from the definition of ψ k , because

    $$\displaystyle{\Phi _{k}({y}^{k},x) \leq \psi _{ k}({y}^{k},x) = \Phi _{ k}({y}^{k},x) + \frac{\tau } {2}\|{y}^{k} - {x\|}^{2} {\leq \psi }^{{\ast}}\leq f(x).}$$

    Since \(\Phi _{k}({y}^{k},x) \rightarrow f(x)\) by part (vi), we deduce \(\frac{\tau }{2}\|{y}^{k} - {x\|}^{2} \rightarrow 0\) using a sandwich argument, which also proves en passant that ψ = f(x) and \(\phi _{k}({y}^{k},x) \rightarrow f(x)\).

    To finish the proof, let us now show 0 ∈ [ɛ] f(x). Remember that by the necessary optimality condition for (26.7) we have \((Q +\tau I)(x - {y}^{k}) \in \partial \phi _{k}({y}^{k},x).\) By the subgradient inequality,

    $$\displaystyle\begin{array}{rcl}{ (x - {y}^{k})}^{\top }(Q +\tau I)(y - {y}^{k})& \leq &\phi _{ k}(y,x) -\phi _{k}({y}^{k},x) {}\\ & \leq &\phi (y,x) -\phi _{k}({y}^{k},x), {}\\ \end{array}$$

    where ϕ is the upper envelope (26.13) of all cutting planes drawn at zB(0,M), g [ɛ] f(z), which we choose large enough to contain the bounded set \(\{x\} \cup \{ {y}^{k}: k \in \mathbb{N}\}\), a fact which assures ϕ k ( ⋅,x) ≤ϕ( ⋅,x) for all k (see Lemma 26.7). Passing to the limit, observing \(\|x - {y}^{k}\|_{Q+\tau I}^{2} \rightarrow 0\) and \(\phi _{k}({y}^{k},x) \rightarrow f(x) =\phi (x,x)\), we obtain

    $$\displaystyle{0 \leq \phi (y,x) -\phi (x,x)}$$

    for all y. This proves 0 ∈ ∂ ϕ(x,x). Since ∂ ϕ(x,x) ⊂ [ɛ] f(x) by Lemma 26.7, we have shown 0 ∈ [ɛ] f(x).

26.6 Convergence of the Outer Loop

In this section we prove subsequence convergence of our algorithm for the case where function values are exact and subgradients are in \(\partial _{[\varepsilon ]}f({y}^{k})\). We write \(Q_{j} = Q({x}^{j})\) for the matrix of the second-order model, which depends on the serious iterates x j.

Theorem 26.12.

Let x 1 be such that \(\Omega =\{ x \in {\mathbb{R}}^{n}: f(x) \leq f({x}^{1})\}\) is bounded. Suppose f is ɛ′-convex on Ω and that subgradients are drawn from ∂ [ɛ] f(y), whereas function values are exact. Then every accumulation point \(\bar{x}\) of the sequence of serious iterates x j satisfies \(0 \in \partial _{[\tilde{\varepsilon }]}f(\bar{x})\) , where \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime}+\varepsilon )/(\gamma -\tilde{\gamma })\) .

Proof.

 

  1. (i)

    From the analysis in Sect. 26.5 we know that if we apply the stopping test in step 2 with \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime}+\varepsilon )/(\gamma -\tilde{\gamma })\), then the inner loop ends after a finite number of steps k with a new x + satisfying the acceptance test in step 5, unless we have finite termination due to \(0 \in \partial _{[\tilde{\varepsilon }]}f(x)\). Let us exclude this case, and let x j denote the infinite sequence of serious iterates. We assume that at outer loop counter j the inner loop finds a serious step at inner loop counter k = k j . In other words, \({y}^{k_{j}} = {x}^{j+1}\) passes the acceptance test in step 5 of the algorithm and becomes a serious iterate, while the y k with k < k j are null steps. That means

    $$\displaystyle\begin{array}{rcl} f({x}^{j}) - f({x}^{j+1}) \geq \gamma \left (f({x}^{j}) - \Phi _{ k_{j}}({x}^{j+1},{x}^{j})\right ).& & {}\end{array}$$
    (26.26)

    Now recall that \((Q_{j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1}) \in \partial \phi _{k_{j}}({x}^{j+1},{x}^{j})\) by optimality of the tangent program (26.7). The subgradient inequality for \(\phi _{k_{j}}(\cdot,{x}^{j})\) at x j+1 therefore gives

    $$\displaystyle\begin{array}{rcl}{ \left ({x}^{j} - {x}^{j+1}\right )}^{\top }(Q_{ j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1})& \leq &\phi _{ k_{j}}({x}^{j},{x}^{j}) -\phi _{ k_{j}}({x}^{j+1},{x}^{j}) {}\\ & =& f({x}^{j}) -\phi _{ k_{j}}({x}^{j+1},{x}^{j}), {}\\ \end{array}$$

    using \(\phi _{k_{j}}({x}^{j},{x}^{j}) = f({x}^{j})\). With \(\Phi _{k}(y,{x}^{j}) =\phi _{k}(y,{x}^{j}) + \frac{1} {2}{(y - {x}^{j})}^{\top }Q_{ j}(y - {x}^{j})\) we have

    $$\displaystyle\begin{array}{rcl} \frac{1} {2}\|{x}^{j+1} - {x}^{j}\|_{ Q_{j}+\tau _{k_{j}}I}^{2} \leq f({x}^{j}) - \Phi _{ k_{j}}({x}^{j+1},{x}^{j}) {\leq \gamma }^{-1}\left (f({x}^{j}) - f({x}^{j+1})\right ),& & {}\end{array}$$
    (26.27)

    using (26.26). Summing (26.27) from j = 1 to j = J gives

    $$\displaystyle{\sum _{j=1}^{J}\|{x}^{j+1} - {x}^{j}\|_{ Q_{j}+\tau _{k_{j}}I}^{2} {\leq \gamma }^{-1}\sum _{ j=1}^{J}\left (f({x}^{j}) - f({x}^{j+1})\right ) {=\gamma }^{-1}\left (f({x}^{1}) - f({x}^{J+1})\right ).}$$

    Here the right-hand side is bounded above because our method is of descent type in the serious steps and Ω is bounded. Consequently the series on the left is summable, and therefore \(\|{x}^{j+1} - {x}^{j}\|_{Q_{j}+\tau _{k_{ j}}I}^{2} \rightarrow 0\) as j. Let \(\bar{x}\) be an accumulation point of the sequence x j. We have to prove \(0 \in \partial _{[\tilde{\varepsilon }]}f(\bar{x})\). We select a subsequence jJ such that \({x}^{j} \rightarrow \bar{ x}\), jJ. There are now two cases. The first is discussed in part (ii); the second is more complicated and will be discussed in (iii)–(ix).

  2. (ii)

    Suppose there exists an infinite subsequence J′ of J such that \(g_{j}\,:=\,(Q_{j} +\tau _{k_{j}}I)\left ({x}^{j} - {x}^{j+1}\right )\) converges to 0, jJ′. We will show that in this case \(0 \in \partial _{[\tilde{\varepsilon }]}f(\bar{x})\).

    In order to prove this claim, notice first that since \(\Omega =\{ x \in {\mathbb{R}}^{n}: f(x) \leq f({x}^{1})\}\) is bounded by hypothesis, and since our algorithm is of descent type in the serious steps, the sequence x j, \(j \in \mathbb{N}\) is bounded. We can therefore use the convex upper envelope function ϕ of (26.13), where B(0,M) contains Ω and also all the trial points y k visited during all inner loops j.

    Indeed, the set of x j being bounded, so are the \(\|g({x}^{j})\|\), where \(g({x}^{j}) \in \partial _{[\varepsilon ]}f({x}^{j})\) is the exactness subgradient of the jth inner loop. From (26.16) we know that \(\|{x}^{j} - {y}^{k}\|_{Q_{j}+\tau _{k}I} \leq \| g({x}^{j})\|\) for every j and every trial step y k arising in the jth inner loop at some instant k. From the management of the τ-parameter in the outer loop (26.12) we know that \(Q_{j} +\tau _{k}I \succ \zeta I\) for some ζ > 0, so \(\|{x}^{j} - {y}^{k}\| {\leq \zeta }^{-1}\|g({x}^{j})\| \leq C < \infty \), meaning the y k are bounded. During the following the properties of ϕ obtained in Lemma 26.7 will be applied at every x = x j.

    Since g j is a subgradient of \(\phi _{k_{j}}(\cdot,{x}^{j})\) at \({x}^{j+1} = {y}^{k_{j}+1}\), we have for every test vector h

    $$\displaystyle\begin{array}{rcl} g_{j}^{\top }h& \leq &\phi _{ k_{j}}({x}^{j+1} + h,{x}^{j}) -\phi _{ k_{j}}({x}^{j+1},{x}^{j}) {}\\ & \leq &\phi ({x}^{j+1} + h,{x}^{j}) -\phi _{ k_{j}}({x}^{j+1},{x}^{j})\qquad \mbox{ [using $\phi _{ k_{j}}(\cdot,{x}^{j}) \leq \phi (\cdot,{x}^{j})$].} {}\\ \end{array}$$

    Now \({y}^{k_{j}} = {x}^{j+1}\) was accepted in step 5 of the algorithm, which means

    $${\displaystyle{\gamma }^{-1}\left (f({x}^{j}) - f({x}^{j+1})\right ) \geq f({x}^{j}) - \Phi _{ k_{j}}({x}^{j+1},{x}^{j}).}$$

    Combining these two estimates for a fixed test vector h gives

    $$\displaystyle\begin{array}{rcl} g_{j}^{\top }h& \leq &\phi ({x}^{j+1} + h,{x}^{j}) - f({x}^{j}) + f({x}^{j}) -\phi _{ k_{j}}({x}^{j+1},{x}^{j}) {}\\ & =& \phi ({x}^{j+1} + h,{x}^{j}) - f({x}^{j}) + f({x}^{j}) - \Phi _{ k_{j}}({x}^{j+1},{x}^{j}) {}\\ & & +\frac{1} {2}{({x}^{j} - {x}^{j+1})}^{\top }Q_{ j}({x}^{j} - {x}^{j+1}) {}\\ & \leq &\phi ({x}^{j+1} + h,{x}^{j}) - f({x}^{j}) {+\gamma }^{-1}\left (f({x}^{j}) - f({x}^{j+1})\right ) {}\\ & & +\frac{1} {2}{({x}^{j} - {x}^{j+1})}^{\top }Q_{ j}({x}^{j} - {x}^{j+1}) {}\\ & =& \phi ({x}^{j+1} + h,{x}^{j}) - f({x}^{j}) {+\gamma }^{-1}\left (f({x}^{j}) - f({x}^{j+1})\right ) {}\\ & & +\frac{1} {2}{({x}^{j} - {x}^{j+1})}^{\top }(Q_{ j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1}) -\frac{\tau _{k_{j}}} {2} \|{x}^{j} - {x{}^{j+1}\|}^{2} {}\\ & \leq &\phi ({x}^{j+1} + h,{x}^{j}) - f({x}^{j}) {+\gamma }^{-1}\left (f({x}^{j}) - f({x}^{j+1})\right ) {}\\ & & +\frac{1} {2}{({x}^{j} - {x}^{j+1})}^{\top }(Q_{ j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1}). {}\\ \end{array}$$

    Now fix \(h{^\prime} \in {\mathbb{R}}^{n}\). Plugging \(h = {x}^{j} - {x}^{j+1} + h{^\prime}\) in the above estimate gives

    $$\displaystyle\begin{array}{rcl} \frac{1} {2}\|{x}^{j}-{x}^{j+1}\|_{ Q_{j}+\tau _{k_{j}}I}^{2}+g_{ j}^{\top }h{^\prime} \leq \phi ({x}^{j} + h{^\prime},{x}^{j})-f({x}^{j}){+\gamma }^{-1}\left (f({x}^{j})-f({x}^{j+1})\right ).& & {}\end{array}$$
    (26.28)

    Passing to the limit jJ′ and using, in the order named, \(\|{x}^{j} - {x}^{j+1}\|_{Q_{j}+\tau _{k_{ j}}I}^{2} \rightarrow 0\), g j → 0, \({x}^{j} \rightarrow \bar{ x}\), \(f({x}^{j}) \rightarrow f(\bar{x}) =\phi (\bar{x},\bar{x})\) and \(f({x}^{j}) - f({x}^{j+1}) \rightarrow 0\), we obtain

    $$\displaystyle\begin{array}{rcl} 0 \leq \phi (\bar{x} + h{^\prime},\bar{x}) -\phi (\bar{x},\bar{x}).& & {}\end{array}$$
    (26.29)

    In (26.28) the rightmost term \(f({x}^{j}) - f({x}^{j+1}) \rightarrow 0\) converges by monotonicity, convergence of the leftmost term was shown in part (i), and g j → 0 is the working hypothesis. Now the test vector h′ in (26.29) is arbitrary, which shows \(0 \in \partial \phi (\bar{x},\bar{x})\). By Lemma 26.7 we have \(0 \in \partial _{[\varepsilon ]}f(\bar{x}) \subset \partial _{[\tilde{\varepsilon }]}f(\bar{x})\).

  3. (iii)

    The second more complicated case is when \(\|g_{j}\| =\| (Q_{j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1})\| \geq \mu > 0\) for some μ > 0 and every jJ. The remainder of this proof will be entirely dedicated to this case.

    We notice first that under this assumption the \(\tau _{k_{j}}\), jJ, must be unbounded. Indeed, assume on the contrary that the \(\tau _{k_{j}}\), jJ, are bounded. By boundedness of Q j and boundedness of the serious steps, there exists then an infinite subsequence jJ′ of J such that Q j , \(\tau _{k_{j}}\), and \({x}^{j} - {x}^{j+1}\) converge respectively to \(\bar{Q}\), \(\bar{\tau }\), and \(\delta \bar{x}\) as jJ′. This implies that the corresponding subsequence of g j converges to \((\bar{Q} +\bar{\tau } I)\delta \bar{x}\), where \(\|(\bar{Q} +\bar{\tau } I)\delta \bar{x}\| \geq \mu > 0\). Similarly, \({({x}^{j} - {x}^{j+1})}^{\top }(Q_{j} +\tau _{k_{j}}I)({x}^{j} - {x}^{j+1}) \rightarrow \delta \bar{ {x}}^{\top }(\bar{Q} +\bar{\tau } I)\delta \bar{x}\). By part (i) of the proof we have \(g_{j}^{\top }({x}^{j+1} - {x}^{j}) =\| {x}^{j+1} - {x}^{j}\|_{Q_{j}+\tau _{k_{ j}}I}^{2} \rightarrow 0\), which means \(\delta \bar{{x}}^{\top }(\bar{Q} +\bar{\tau } I)\delta \bar{x} = 0\). Since \(\bar{Q} +\bar{\tau } I\) is symmetric and \(\bar{Q} +\bar{\tau } I\succeq 0\), we deduce \((\bar{Q} +\bar{\tau } I)\delta \bar{x} = 0\), contradicting \(\|(\bar{Q} +\bar{\tau } I)\delta \bar{x}\| \geq \mu > 0\). This argument proves that the \(\tau _{k_{j}}\), jJ, are unbounded.

  4. (iv)

    Having shown that the sequence \(\tau _{k_{j}}\), jJ is unbounded, we can without loss assume that \(\tau _{k_{j}} \rightarrow \infty \), jJ, passing to a subsequence if required. Let us now distinguish two types of indices jJ. We let J + be the set of those jJ for which the τ-parameter was increased at least once during the jth inner loop. The remaining indices jJ are those where the τ-parameter remained unchanged during the jth inner loop. Since the jth inner loop starts at \(\tau _{j}^{\sharp }\) and ends at \(\tau _{k_{j}}\), we have

    $$\displaystyle{{J}^{+} =\{ j \in J:\tau _{ k_{j}} <\tau _{ j}^{\sharp }\}\;\mbox{ and }\;{J}^{-} =\{ j \in J:\tau _{ k_{j}} =\tau _{ j}^{\sharp }\}.}$$

    We claim that the set J must be finite. For suppose J is infinite, then \(\tau _{k_{j}} \rightarrow \infty \), jJ . Hence also \(\tau _{j}^{\sharp } \rightarrow \infty \), jJ . But this contradicts the rule in step 8 of the algorithm, which forces \(\tau _{j}^{\sharp } \leq T < \infty \). This contradiction shows that J + is cofinal in J.

  5. (v)

    Remember that we are still in the case whose discussion started in point (iii). We are now dealing with an infinite subsequence jJ + of jJ such that \(\tau _{k_{j}} \rightarrow \infty \), \(\|g_{j}\| \geq \mu > 0\), and such that the τ-parameter was increased at least once during the jth inner loop. Suppose this happened for the last time at stage k j ν j for some ν j ≥ 1. Then

    $$\displaystyle\begin{array}{rcl} \tau _{k_{j}} =\tau _{k_{j}-1} =\ldots =\tau _{k_{j}-\nu _{j}+1} = 2\tau _{k_{j}-\nu _{j}}.& & {}\end{array}$$
    (26.30)

    According to step 6 of the algorithm, the increase at counter k j ν j is due to the fact that

    $$\displaystyle\begin{array}{rcl} \rho _{k_{j}-\nu _{j}} <\gamma \;\; \mbox{ and }\;\;\tilde{\rho }_{k_{j}-\nu _{j}} \geq \tilde{\gamma }.& & {}\end{array}$$
    (26.31)

    This case is labelled too bad in step 6 of the algorithm.

  6. (vi)

    Condition (26.31) means that there are infinitely many jJ + satisfying

    $$\displaystyle\begin{array}{rcl} \rho _{k_{j}-\nu _{j}} = \frac{f({x}^{j}) - f({y}^{k_{j}-\nu _{j}})} {f({x}^{j}) - \Phi _{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})} <\gamma & & {}\\ \end{array}$$

    and

    $$\displaystyle\begin{array}{rcl} \tilde{\rho }_{k_{j}-\nu _{j}} = \frac{f({x}^{j}) - M_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})} {f({x}^{j}) - \Phi _{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}+1},{x}^{j})} \geq \tilde{\gamma }.& & {}\\ \end{array}$$

    Notice first that as \(\tau _{k_{j}} \rightarrow \infty \) and \(\tau _{k_{j}} = 2\tau _{k_{j}-\nu _{j}}\), boundedness of the subgradients \(\tilde{g}_{j}\,:=\,(Q_{j} + \frac{1} {2}\tau _{k_{j}}I)({x}^{j} - {y}^{k_{j}-\nu _{j}}) \in \partial \phi _{k_{ j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})\) shows \({y}^{k_{j}-\nu _{j}} \rightarrow \bar{ x}\). Indeed, boundedness of the \(\tilde{g}_{j}\) follows from the subgradient inequality

    $$\displaystyle\begin{array}{rcl}{ ({x}^{j}-{y}^{k_{j}-\nu _{j} })}^{\top }(Q_{ j}+\tau _{k_{j}-\nu _{j}}I)({x}^{j}-{y}^{k_{j}-\nu _{j} })& \leq &\phi _{k_{j}-\nu _{j}}({x}^{j},{x}^{j})-\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \\ & \leq & f({x}^{j})-m_{ 0}({y}^{k_{j}-\nu _{j} },{x}^{j}) \\ & =& g{({x}^{j})}^{\top }({x}^{j}-{y}^{k_{j}-\nu _{j} }) \\ & \leq &\|g({x}^{j})\|\|{x}^{j}-{y}^{k_{j}-\nu _{j} }\|, {}\end{array}$$
    (26.32)

    where \(m_{0}(\cdot,{x}^{j}) = f({x}^{j}) + g{({x}^{j})}^{\top }(\cdot - {x}^{j})\) is the exactness plane at x j. As \(\tau _{k_{j}} \rightarrow \infty \), we have \(\tau _{k_{j}-\nu _{j}} = \frac{1} {2}\tau _{k_{j}} \rightarrow \infty \), too, so the left-hand side of (26.32) behaves asymptotically like constant times \(\tau _{k_{j}-\nu _{j}}\|{x}^{j} - {y{}^{k_{j}-\nu _{j}}\|}^{2}\). On the other hand the x jΩ are bounded, hence so are the g(x j). The right-hand side therefore behaves asymptotically like constant times \(\|{x}^{j} - {y}^{k_{j}-\nu _{j}}\|\). This shows boundedness of \(\tau _{k_{j}-\nu _{j}}\|{x}^{j} - {y}^{k_{j}-\nu _{j}}\|\), and therefore \({x}^{j} - {y}^{k_{j}-\nu _{j}} \rightarrow 0\), because \(\tau _{k_{j}-\nu _{j}} \rightarrow \infty \).

  7. (vii)

    Recall that \({x}^{j} \rightarrow \bar{ x}\), jJ. By (vi) we know that \({y}^{k_{j}-\nu _{j}} \rightarrow \bar{ x}\), jJ. Passing to a subsequence J′ of J, we may assume \(\tilde{g}_{j} \rightarrow \tilde{ g}\) for some \(\tilde{g}\). We show \(\tilde{g} \in \partial \phi (\bar{x},\bar{x}).\)

    For a test vector h and jJ′,

    $$\displaystyle\begin{array}{rcl} \tilde{g}_{j}^{\top }h& \leq &\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} } + h,{x}^{j}) -\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \\ & \leq &\phi ({y}^{k_{j}-\nu _{j} } + h,{x}^{j}) -\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}). {}\end{array}$$
    (26.33)

    Using the fact that \(\tilde{\rho }_{k_{j}-\nu _{j}} \geq \tilde{\gamma }\), we have

    $$\displaystyle{f({x}^{j}) - \Phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) {\leq \tilde{\gamma }}^{-1}\left (f({x}^{j}) - M_{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j})\right ).}$$

    Adding \(\frac{1} {2}{({y}^{k_{j}-\nu _{j}} - {x}^{j})}^{\top }Q_{ j}({y}^{k_{j}-\nu _{j}} - {x}^{j})\) on both sides gives

    $$\displaystyle\begin{array}{rcl} & & f({x}^{j}) -\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) {}\\ & & {\leq \tilde{\gamma }}^{-1}\left (f({x}^{j}) - M_{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j})\right ) + \frac{1} {2}{({y}^{k_{j}-\nu _{j}} - {x}^{j})}^{\top }Q_{ j}({y}^{k_{j}-\nu _{j}} - {x}^{j}). {}\\ \end{array}$$

    Combining this and estimate (26.33) gives

    $$\displaystyle\begin{array}{rcl} \tilde{g}_{j}^{\top }h& \leq &\phi ({y}^{k_{j}-\nu _{j} } + h,{x}^{j}) - f({x}^{j}) {+\tilde{\gamma } }^{-1}\left (f({x}^{j}) - M_{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j})\right ) \\ & & +\frac{1} {2}{({y}^{k_{j}-\nu _{j}} - {x}^{j})}^{\top }Q_{ j}({y}^{k_{j}-\nu _{j}} - {x}^{j}). {}\end{array}$$
    (26.34)

    As we have seen \({y}^{k_{j}-\nu _{j}} - {x}^{j} \rightarrow 0\), hence the rightmost term in (26.34) converges to 0 by boundedness of Q j . Moreover, we claim that \(\lim f({x}^{j}) - M_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j}) = 0\), so the term \({\tilde{\gamma }}^{-1}(\ldots )\) on the right-hand side of (26.34) converges to 0. Indeed, to see this claim, notice first that it suffices to show \(f({x}^{j}) - m_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j}) \rightarrow 0\), because the second-order term converges to 0. Since \(m_{k_{j}-\nu _{j}}(\cdot,{x}^{j})\) is a cutting plane at x j, we have \(m_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j}) \leq f({y}^{k_{j}-\nu _{j}})\) by definition of the downshift. So it suffices to show \(\liminf m_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j}) \geq f(\bar{x})\). Now this follows from the definition of the downshift s j at \({y}^{k_{j}-\nu _{j}}\) with regard to x j. Recall that for the tangent \(t_{k_{j}-\nu _{j}}(\cdot )\) at \({y}^{k_{j}-\nu _{j}}\), approximate subgradient \(\tilde{g}_{j}\), and serious iterate x j, we have

    $$\displaystyle{s_{j} = [t_{k_{j}-\nu _{j}}({x}^{j}) - f({x}^{j})]_{ +} + c\|{y}^{k_{j}-\nu _{j} } - {x{}^{j}\|}^{2}.}$$

    We can clearly concentrate on proving \(t_{k_{j}-\nu _{j}}({x}^{j}) - f({x}^{j}) \rightarrow 0\). Now \(t_{k_{j}-\nu _{j}}({x}^{j}) - f({x}^{j}) = f({y}^{k_{j}-\nu _{j}}) - f({x}^{j}) +\tilde{ g}_{j}^{\top }({x}^{j} - {y}^{k_{j}-\nu _{j}})\), and since \({y}^{k_{j}-\nu _{j}} \rightarrow \bar{ x}\), \({x}^{j} \rightarrow \bar{ x}\), and the \(\tilde{g}_{j}\) are bounded, our claim follows.

    Going back to (26.34) with the information \(\tilde{g}_{j}^{\top }h \rightarrow \tilde{ {g}}^{\top }h\), it remains to prove \(\limsup \phi ({y}^{k_{j}-\nu _{j}} + h,{x}^{j}) \leq \phi (\bar{x} + h,\bar{x})\). Indeed, once this is proved, passing to the limit in (26.34) shows \(\tilde{{g}}^{\top }h \leq \phi (\bar{x} + h,\bar{x}) - f(\bar{x}) =\phi (\bar{x} + h,\bar{x}) -\phi (\bar{x},\bar{x})\). This proves \(\tilde{g} \in \partial \phi (\bar{x},\bar{x})\), and then \(\tilde{g} \in \partial _{[\varepsilon ]}f(\bar{x})\) by Lemma 26.7.

    What remains to be shown is obviously joint upper semicontinuity of ϕ at \((\bar{x} + h,\bar{x})\), and this follows from Lemma 26.7; hence our claim \(\tilde{g} \in \partial _{[\varepsilon ]}f(\bar{x})\) is proved.

  8. (viii)

    Let \(\eta \,:=\,\text{dist}\left (0,\partial \phi (\bar{x},\bar{x})\right )\). Then \(\|\tilde{g}\| \geq \eta\) by (vii) above. Let us fix 0 <ζ < 1; then, as \(\tilde{g}_{j} \rightarrow \tilde{ g}\), we have \(\|\tilde{g}_{j}\| \geq (1-\zeta )\eta\) for jJ′ large enough.

    Now, assuming first \([\ldots ]_{+} > 0\) in the downshift, we have

    $$\displaystyle\begin{array}{rcl} m_{k_{j}-\nu _{j}}(\cdot,{x}^{j})& =& f({y}^{k_{j}-\nu _{j} }) +\tilde{ g}_{j}^{\top }(\cdot - {y}^{k_{j}-\nu _{j} }) - s_{j} {}\\ & =& f({y}^{k_{j}-\nu _{j} })+\tilde{g}_{j}^{\top }(\cdot -{y}^{k_{j}-\nu _{j} })-c\|{y}^{k_{j}-\nu _{j} }-{x{}^{j}\|}^{2}-t_{ k_{j}-\nu _{j}}({x}^{j})+f({x}^{j}) {}\\ & =& f({x}^{j}) +\tilde{ g}_{ j}^{\top }(\cdot - {x}^{j}) - c\|{y}^{k_{j}-\nu _{j} } - {x{}^{j}\|}^{2}, {}\\ \end{array}$$

    for \(\tilde{g}_{j} \in \partial _{[\varepsilon ]}f({y}^{k_{j}-\nu _{j}})\) as above. Pick \(g_{j} \in \partial f({y}^{k_{j}-\nu _{j}})\) such that \(\|g_{j} -\tilde{ g}_{j}\| \leq \varepsilon\). Then

    $$\displaystyle\begin{array}{rcl} f({y}^{k_{j}-\nu _{j} }) - m_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j})& =& f({y}^{k_{j}-\nu _{j} }) - f({x}^{j}) -\tilde{ g}_{ j}^{\top }({y}^{k_{j}-\nu _{j} } - {x}^{j}) {}\\ & & +c\|{y}^{k_{j}-\nu _{j} } - {x{}^{j}\|}^{2} {}\\ & =& f({y}^{k_{j}-\nu _{j} }) - f({x}^{j}) - g_{ j}^{\top }({y}^{k_{j}-\nu _{j} } - {x}^{j}) {}\\ & & +(\tilde{g}_{j} - g_{j})({y}^{k_{j}-\nu _{j} } - {x}^{j}) + c\|{y}^{k_{j}-\nu _{j} } - {x{}^{j}\|}^{2}. {}\\ \end{array}$$

    Since f is ɛ′-convex, we have \(g_{j}^{\top }({y}^{k_{j}-\nu _{j}} - {x}^{j}) \leq f({x}^{j}) - f({y}^{k_{j}-\nu _{j}}) +\varepsilon {^\prime}\|{y}^{k_{j}-\nu _{j}} - {x}^{j}\|\). Substituting this we get

    $$\displaystyle\begin{array}{rcl} f({y}^{k_{j}-\nu _{j} }) - m_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \leq (\varepsilon {^\prime}+\varepsilon )\|{y}^{k_{j}-\nu _{j} } - {x}^{j}\| + c\|{y}^{k_{j}-\nu _{j} } - {x{}^{j}\|}^{2}.& & {}\end{array}$$
    (26.35)

    In the case \([\ldots ]_{+} = 0\) an even better estimate is obtained, so that (26.35) covers both cases. Subtracting a term \(\frac{1} {2}{({y}^{k_{j}-\nu _{j}} - {x}^{j})}^{\top }Q_{ j}({y}^{k_{j}-\nu _{j}} - {x}^{j})\) on both sides of (26.35) and using \({y}^{k_{j}-\nu _{j}} - {x}^{j} \rightarrow 0\), we get

    $$\displaystyle{f({y}^{k_{j}-\nu _{j} }) - M_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \leq (\varepsilon {^\prime} +\varepsilon +\nu _{ j})\|{y}^{k_{j}-\nu _{j} } - {x}^{j}\|,}$$

    where ν j → 0. In consequence

    $$\displaystyle\begin{array}{rcl} f({y}^{k_{j}-\nu _{j} }) - M_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \leq (1+\zeta )(\varepsilon {^\prime}+\varepsilon )\|{y}^{k_{j}-\nu _{j} } - {x}^{j}\|\qquad & & {}\end{array}$$
    (26.36)

    for j large enough. Recall that \(\tilde{g}_{j} = (Q_{j} + \frac{1} {2}\tau _{k_{j}}I)({x}^{j} - {y}^{k_{j}-\nu _{j}}) \in \partial \phi _{k_{ j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})\) by (26.8) and (26.30). Hence by the subgradient inequality

    $$\displaystyle\begin{array}{rcl} \tilde{g}_{j}^{\top }({x}^{j} - {y}^{k_{j}-\nu _{j} }) \leq \phi _{k_{j}-\nu _{j}}({x}^{j},{x}^{j}) -\phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}).& & {}\\ \end{array}$$

    Subtracting a term \(\frac{1} {2}{({x}^{j} - {y}^{k_{j}-\nu _{j}})}^{\top }Q_{ j}({x}^{j} - {y}^{k_{j}-\nu _{j}})\) from both sides gives

    $$\displaystyle\begin{array}{rcl} \frac{1} {2}{({x}^{j}-{y}^{k_{j}-\nu _{j}})}^{\top }Q_{ j}({x}^{j}-{y}^{k_{j}-\nu _{j}}) + \frac{1} {2}\tau _{k_{j}}\|{x}^{j}-{y{}^{k_{j}-\nu _{j}}\|}^{2} \leq f({x}^{j})-\Phi _{k_{ j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j}).& & {}\end{array}$$
    (26.37)

    As \(\tau _{k_{j}} \rightarrow \infty \), we have

    $$\displaystyle\begin{array}{rcl} (1-\zeta )\frac{1} {2}\tau _{k_{j}}\|{x}^{j} - {y}^{k_{j}-\nu _{j}}\| \leq \|\tilde{ g}_{j}\| \leq (1+\zeta )\frac{1} {2}\tau _{k_{j}}\|{x}^{j} - {y}^{k_{j}-\nu _{j}}\|& & {}\end{array}$$
    (26.38)

    and

    $$\displaystyle\begin{array}{rcl} \frac{1} {2}{({x}^{j}-{y}^{k_{j}-\nu _{j}})}^{\top }Q_{ j}({x}^{j}-{y}^{k_{j}-\nu _{j}})+\frac{1} {2}\tau _{k_{j}}\|{x}^{j}-{y{}^{k_{j}-\nu _{j}}\|}^{2} \geq (1-\zeta )\frac{1} {2}\tau _{k_{j}}\|{x}^{j}-{y{}^{k_{j}-\nu _{j}}\|}^{2}& & {}\end{array}$$
    (26.39)

    both for j large enough. Therefore, plugging (26.38) and (26.39) into (26.37) gives

    $$\displaystyle{f({x}^{j}) - \Phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \geq \frac{1-\zeta } {1+\zeta }\|\tilde{g}_{j}\|\|{x}^{j} - {y}^{k_{j}-\nu _{j}}\|}$$

    for j large enough. Since \(\|\tilde{g}_{j}\| \geq (1-\zeta )\eta\) for j large enough, we deduce

    $$\displaystyle\begin{array}{rcl} f({x}^{j}) - \Phi _{ k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \geq \frac{{(1-\zeta )}^{2}} {1+\zeta } \eta \|{x}^{j} - {y}^{k_{j}-\nu _{j}}\|.& & {}\end{array}$$
    (26.40)
  9. (ix)

    Combining (26.36) and (26.40) gives the estimate

    $$\displaystyle\begin{array}{rcl} \tilde{\rho }_{k_{j}-\nu _{j}}& =& \rho _{k_{j}-\nu _{j}} + \frac{f({y}^{k_{j}-\nu _{j}}) - M_{k_{ j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})} {f({x}^{j}) - \Phi _{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j}},{x}^{j})} \\ & \leq &\rho _{k_{j}-\nu _{j}} + \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )\|{y}^{k_{j}-\nu _{j}} - {x}^{j}\|} {{(1-\zeta )}^{2}\eta \|{y}^{k_{j}-\nu _{j}} - {x}^{j}\|}. {}\end{array}$$
    (26.41)

    This proves

    $$\displaystyle{\eta \leq \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}(\tilde{\gamma }-\gamma )}.}$$

    For suppose we had \(\eta > \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}(\tilde{\gamma }-\gamma )}\), then \(\frac{{(1+\zeta )}^{2}(\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}\eta } <\tilde{\gamma } -\gamma\), which gave \(\tilde{\rho }_{k_{j}-\nu _{j}} \leq \rho _{k_{j}-\nu _{j}} +\tilde{\gamma } -\gamma <\tilde{\gamma }\) for all j, contradicting \(\tilde{\rho }_{k_{j}-\nu _{j}} \geq \tilde{\gamma }\) for infinitely many jJ.

    Since ζ in the above discussion was arbitrary, we have shown \(\eta \leq \frac{\varepsilon {^\prime}+\varepsilon } {\tilde{\gamma }-\gamma }\). Recall that \(\eta = \text{dist}\left (0,\partial _{[\varepsilon ]}f(\bar{x})\right )\). We therefore have shown \(0 \in \partial _{[\tilde{\varepsilon }]}f(\bar{x})\), where \(\tilde{\varepsilon }=\varepsilon +\eta\). This is what is claimed.

Corollary 26.13.

Suppose \(\Omega =\{ x \in {\mathbb{R}}^{n}: f(x) \leq f({x}^{1})\}\) is bounded and f is lower C 1 . Let approximate subgradients be drawn from ∂ [ɛ] f(y), whereas function values are exact. Then every accumulation point \(\bar{x}\) of the sequence of serious iterates x j satisfies \(0 \in \partial _{[\alpha \varepsilon ]}f(\bar{x})\) , where \(\alpha = 1 + {(\tilde{\gamma }-\gamma )}^{-1}\) .

Remark 26.14.

At first glance one might consider the class of lower C 1 functions used in Corollary 26.13 as too restrictive to offer sufficient scope. This misapprehension might be aggravated, or even induced, by the fact that lower C 1 functions are approximately convex [14, 34], an unfortunate nomenclature which erroneously suggests something close to a convex function. We therefore stress that lower C 1 is a large class which includes all examples we have so far encountered in practice. Indeed, applications are as a rule even lower C 2, or amenable in the sense of Rockafellar [45], a much smaller class, yet widely accepted as of covering all applications of interest.

Recent approaches to nonconvex nonsmooth optimization like [21, 33, 47] all work with composite (and therefore lower C 2) functions. This is in contrast with our own approach [19, 20, 40, 42], which works for lower C 1 and is currently the only one I am aware of that has the technical machinery to go beyond lower C 2. On second glance one will therefore argue that it is rather the class of lower C 2 functions which does not offer sufficient scope to justify the development of a new theory, because the chapter on nonsmooth composite convex functions f = gF in [46] covers this class nicely and leaves little space for new contributions and because one can do things for lower C 1.

26.7 Extension to Inexact Values

In this section we discuss what happens when we have not only inexact subgradients but also inexact function values. In the previous sections we assumed that for every approximate subgradient g a of f at x, there exists an exact subgradient g∂ f(x) such that \(\|g_{a} - g\| \leq \varepsilon\). Similarly, we will assume that approximate function values f a (x) satisfy \(\vert f_{a}(x) - f(x)\vert \leq \bar{\varepsilon }\) for a fixed error tolerance \(\bar{\varepsilon }\). We do not assume any link between ɛ and \(\bar{\varepsilon }\).

Let us notice the following fundamental difference between the convex and the nonconvex case, where it is often reasonable to assume f a f; see, e.g., [30, 31]. Suppose f is convex, x is the current iterate, and an approximate value \(f(x)-\bar{\varepsilon }\leq f_{a}(x) \leq f(x)\) is known. Suppose y k is a null step, so that we draw an approximate tangent plane \(t_{k}(\cdot ) = f_{a}({y}^{k}) + g_{k}^{\top }(\cdot - {y}^{k})\) at y k with respect to \(g_{k} \in \partial _{[\varepsilon ]}f({y}^{k})\). If we follow [30, 31], then t k ( ⋅), while not a support plane, is still an affine minorant of f. It may then happen that \(t_{k}(x) = f_{a}({y}^{k}) + g_{k}^{\top }(x - {y}^{k}) > f_{a}(x)\), because \(f_{a}(x),f_{a}({y}^{k})\) are approximations only. Now the approximate cutting plane gives us viable information as to the fact that the true value f(x) satisfies \(f(x) \geq t_{k}(x) > f_{a}(x)\). We shall say that we can trust the value t k (x) > f a (x).

What should we do if we find a value t k (x) in which we can trust and which reveals our estimate f a (x) as too low? Should we correct f a (x) and replace it by the better estimate now available? If we do this we create trouble. Namely, we have previously rejected trial steps y k during the inner loop at x based on the incorrect information f a (x). Some of these steps might have been acceptable, had we used t k (x) instead. But on the other hand, x was accepted as serious step in the inner loop at x because f a (x) was sufficiently below f a (x ). If we correct the approximate value at x, then acceptance of x may become unsound as well. For short, correcting values as soon as better estimates arrive is not a good idea, because we might be forced to go repeatedly back all the way through the history of our algorithm.

In order to avoid this backtracking, Kiwiel [30] proposes the following original idea. If f a (x), being too low, still allows progress in the sense that x + with \(f_{a}({x}^{+}) < f_{a}(x)\) can be found, then why waste time and correct the value f a (x)? After all, there is still progress! On the other hand, if the underestimation f a (x) is so severe that the algorithm will stop, then we should be sure that no further decrease within the error tolerances \(\bar{\varepsilon },\varepsilon\) is possible. Namely, if this is the case, then we can stop in all conscience. To check this, Kiwiel progressively relaxes proximity control in the inner loop, until it becomes clear that the model of all possible approximate cutting planes itself does not allow to descend below f a (x) and, therefore, does not allow to descend more than \(\bar{\varepsilon }\) below f(x).

The situation outlined is heavily based on convexity and does not appear to carry over to nonconvex problems. The principal difficulty is that without convexity we cannot trust values \(t_{y,g}(x) > f_{a}(x)\) even in the case of exact tangent planes, g∂ f(y). We know that tangents have to be downshifted, and without the exact knowledge of f(x), the only available reference value to organize the downshift is f a (x). Naturally, as soon as we downshift with reference to f a (x), cutting planes m y,g ( ⋅,x) satisfying \(m_{y,g}(x,x) > f_{a}(x)\) can no longer occur. This removes one of the difficulties. However, it creates, as we shall see, a new one.

In order to proceed with inexact function values, we will need the following property of the cutting plane \(m_{k}(\cdot,x)\,:=\,t_{k}(\cdot ) - s_{k}\) at null step y k and approximate subgradient g k [ɛ] f(y k). We need to find \(\tilde{\varepsilon }> 0\) such that \(f_{a}({y}^{k}) \leq m_{k}({y}^{k},x) +\tilde{\varepsilon }\| x - {y}^{k}\|\). More explicitly, this requires

$$\displaystyle{f_{a}({y}^{k}) \leq f_{ a}(x) + g_{k}^{\top }({y}^{k} - x) +\tilde{\varepsilon }\| x - {y}^{k}\|.}$$

If f is ɛ′-convex, then

$$\displaystyle\begin{array}{rcl} f({y}^{k})& \leq & f(x) + {g}^{\top }({y}^{k} - x) +\varepsilon {^\prime}\|x - {y}^{k}\| {}\\ & \leq & f(x) + g_{k}^{\top }({y}^{k} - x) + (\varepsilon {^\prime}+\varepsilon )\|x - {y}^{k}\| {}\\ \end{array}$$

for g∂ f(y k) and \(\|g - g_{k}\| \leq \varepsilon\). That means

$$\displaystyle{f({y}^{k}) - (f(x) - f_{ a}(x)) \leq f_{a}(x) + g_{k}^{\top }({y}^{k} - x) + (\varepsilon +\varepsilon {^\prime})\|x - {y}^{k}\|.}$$

So what we need in addition is something like

$$\displaystyle{f_{a}({y}^{k}) \leq f({y}^{k}) - (f(x) - f_{ a}(x)) +\varepsilon {^\prime \prime}\|x - {y}^{k}\|,}$$

because then we get the desired relation with \(\tilde{\varepsilon }=\varepsilon +\varepsilon {^\prime} +\varepsilon {^\prime \prime}\). The condition can still be slightly relaxed to make it more useful in practice. The axiom we need is that there exist \(\delta _{k} \rightarrow {0}^{+}\) such that

$$\displaystyle\begin{array}{rcl} f(x) - f_{a}(x) \leq f({y}^{k}) - f_{ a}({y}^{k}) + (\varepsilon {^\prime \prime} +\delta _{ k})\|x - {y}^{k}\|& &{}\end{array}$$
(26.42)

for every \(k \in \mathbb{N}\). Put differently, as y kx, the error we make at y k by underestimating f(y k) by \(f_{a}({y}^{k})\) is larger than the corresponding underestimation error at x, up to a term proportional to \(\|x - {y}^{k}\|\). The case of exact values f = f a corresponds to \(\varepsilon {^\prime \prime} = 0,\delta _{k} = 0\).

Remark 26.15.

As f is continuous at x, condition (26.42) implies upper semi-continuity of f a at serious iterates, i.e., \(\limsup f_{a}({y}^{k}) \leq f_{a}(x)\).

We are now ready to modify our algorithm and then run through the proofs of Lemmas 26.9 and  26.11 and Theorem 26.12 and see what changes need to be made to account for the new situation. As far as the algorithm is concerned, the changes are easy. We replace f(y k) and f(x) by \(f_{a}({y}^{k})\) and f a (x). The rest of the procedure is the same.

We consider the same convex envelope function ϕ( ⋅,x) defined in (26.13). We have the following.

Lemma 26.16.

The upper envelope model satisfies ϕ(x,x) = f a (x), ϕ k ≤ϕ. ϕ is jointly upper \(2\bar{\varepsilon }\) -semicontinuous, and   \(\partial \phi (x,x)\ \subset \ \partial _{[\varepsilon ]}f(x)\ \subset \ \partial _{2\bar{\varepsilon }}\phi (x,x)\) , where \(\partial _{2\bar{\varepsilon }}\phi (x,x)\) is the \(2\bar{\varepsilon }\) -subdifferential of ϕ(⋅,x) at x in the usual convex sense.

Proof.

 

  1. (1)

    Any cutting plane m z,g ( ⋅,x) satisfies \(m_{z,g}(x,x) \leq f_{a}(x) - c\|x - {z\|}^{2}\). This shows ϕ(x,x) ≤ f a (x), and if we take z = x, we get equality ϕ(x,x) = f a (x).

  2. (2)

    We prove \(\partial _{[\varepsilon ]}f(x) \subset \partial _{2\bar{\varepsilon }}\phi (x,x)\). Let g∂ f(x) be a limiting subgradient, and choose y kx, where f is differentiable at y k with \(g_{k} = \nabla f({y}^{k}) \in \partial f({y}^{k})\) such that g k g. Let g a be an approximate subgradient such that \(\|g - g_{a}\| \leq \varepsilon\). We have to prove \(g_{a} \in \partial _{2\bar{\varepsilon }}\phi (x,x)\). Putting \(g_{a,k}\,:=\,g_{k} + g_{a} - g \in \partial _{[\varepsilon ]}f({y}^{k})\) we have \(g_{a,k} \rightarrow g_{a}\). Let m k ( ⋅,x) be the cutting plane drawn at y k with approximate subgradient g a,k . That is, \(m_{k}(\cdot,x) = m_{{y}^{k},g_{a,k}}(\cdot,x)\). Then

    $$\displaystyle{m_{k}(y,x) = f_{a}({y}^{k}) + g_{ a,k}^{\top }(y - {y}^{k}) - s_{ k},}$$

    where \(s_{k} = [f_{a}(x) - t_{k}(x)]_{+} + c\|x - {y{}^{k}\|}^{2}\) is the downshift and where t k ( ⋅) is the approximate tangent at y k with respect to g a,k . There are two cases, \(s_{k} = c\|x - {y{}^{k}\|}^{2}\) and \(s_{k} = f_{a}(x) + t_{k}(x) + c\|x - {y{}^{k}\|}^{2}\), according to whether \([\ldots ]_{+} = 0\) or \([\ldots ]_{+} > 0\). Let us start with the case \(t_{k}(x) > f_{a}(x)\). Then

    $$\displaystyle{s_{k} = f_{a}({y}^{k}) + g_{ a,k}^{\top }(x - {y}^{k}) + c\|x - {y{}^{k}\|}^{2}}$$

    and

    $$\displaystyle{m_{k}(y,x) = f_{a}({y}^{k})+g_{ a,k}^{\top }(y-{y}^{k})-f_{ a}({y}^{k})-g_{ a,k}^{\top }(x-{y}^{k})+f_{ a}(x)-c\|x-{y{}^{k}\|}^{2}.}$$

    Therefore

    $$\displaystyle{\phi (y,x) -\phi (x,x) \geq m_{k}({y}^{k},x) - f_{ a}(x) = g_{a,k}^{\top }(y - x) - c\|x - {y{}^{k}\|}^{2}.}$$

    Passing to the limit k proves g a ∂ ϕ(x,x), so in this case a stronger statement holds.

    Let us next discuss the case where \(t_{k}(x) \leq f_{a}(x)\), so that \(s_{k} = c\|x - {y{}^{k}\|}^{2}\). Then

    $$\displaystyle{m_{k}(y,x) = f_{a}({y}^{k}) + g_{ a,k}^{\top }(y - {y}^{k}) - c\|x - {y{}^{k}\|}^{2}.}$$

    Therefore

    $$\displaystyle\begin{array}{rcl} \phi (y,x) -\phi (x,x)& \geq & m_{k}({y}^{k},x) - f_{ a}(x) {}\\ & =& f_{a}({y}^{k}) - f_{ a}(x) + g_{a,k}^{\top }(y - {y}^{k}) - c\|x - {y{}^{k}\|}^{2} {}\\ & =& f_{a}({y}^{k}) - f_{ a}(x) + g_{a,k}^{\top }(x - {y}^{k}) - c\|x - {y{}^{k}\|}^{2} + g_{ a,k}^{\top }(y - x). {}\\ \end{array}$$

    Put \(\zeta _{k}\,:=\,g_{a,k}^{\top }(x - {y}^{k}) - c\|x - {y{}^{k}\|}^{2} + {(g_{a,k} - g_{a})}^{\top }(y - x)\) then

    $$\displaystyle{\phi (y,x) -\phi (x,x) \geq f_{a}({y}^{k}) - f_{ a}(x) +\zeta _{k} + g_{a}^{\top }(y - x).}$$

    Notice that limζ k = 0, because g a,k g a and y kx. Let F a (x) := liminf k f a (y k), then we obtain

    $$\displaystyle{\phi (y,x) -\phi (x,x) \geq F_{a}(x) - f_{a}(x) + g_{a}^{\top }(y - x).}$$

    Putting \(\varepsilon (x)\,:=\,[f_{a}(x) - F_{a}(x)]_{+}\), we therefore have shown

    $$\displaystyle{\phi (y,x) -\phi (x,x) \geq -\varepsilon (x) + g_{a}^{\top }(y - x),}$$

    which means g a ɛ(x) ϕ(x,x). Since approximate values f a are within \(\bar{\varepsilon }\) of exact values f, we have \(\vert f_{a}(x) - F_{a}(x)\vert \leq 2\bar{\varepsilon }\), hence \(\varepsilon (x) \leq 2\bar{\varepsilon }\). That shows \(g_{a} \in \partial _{\varepsilon (x)}\phi (x,x) \subset \partial _{2\bar{\varepsilon }}\phi (x,x)\).

  3. (3)

    The proof of \(\partial \phi (x,x) \subset \partial _{[\varepsilon ]}f(x)\) remains the same, after replacing f(x) by f a (x).

  4. (4)

    If a sequence of planes m r ( ⋅), \(r \in \mathbb{N}\), contributes to the envelope function ϕ( ⋅,x) and if m r ( ⋅) → m( ⋅) in the pointwise sense, then m( ⋅) also contributes to ϕ( ⋅,x), because the graph of ϕ( ⋅,x) is closed. On the other hand, we may expect discontinuities as x j x. We obtain \(\limsup _{j\rightarrow \infty }\phi (y_{j},x_{j}) \leq \phi (y,x)+\bar{\varepsilon }\) for y j y, x j x.

Remark 26.17.

If approximate function values are underestimations, f a f, as is often the case, then \(\vert F_{a} - f_{a}\vert \leq \bar{\varepsilon }\) and the result holds with \(\partial \phi (x,x) \subset \partial _{[\varepsilon ]}f(x) \subset \partial _{\bar{\varepsilon }}\phi (x,x)\).

Corollary 26.18.

Under the hypotheses of Lemma  26.16 , if x is a point of continuity of f a , then \(\partial \phi (x,x) = \partial _{[\varepsilon ]}f(x)\) and ϕ is jointly upper semicontinuous at (x,x).

Proof.

Indeed, as follows from part (2) of the proof above, for a point of continuity x of f a , we have ɛ(x) = 0. ■

Lemma 26.19.

Suppose the inner loop at serious iterate x turns forever and τ k →∞. Suppose f is ɛ′-convex on a set containing all y k , k ≥ k 0 , and let (26.42) be satisfied. Then \(0 \in \partial _{[\tilde{\varepsilon }]}f(x)\) , where \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime \prime} +\varepsilon {^\prime}+\varepsilon )/(\tilde{\gamma }-\gamma )\) .

Proof.

We go through the proof of Lemma 26.9 and indicate the changes caused by using approximate values \(f_{a}({y}^{k})\), f a (x). Part (ii) remains the same, except that ϕ(x,x) = f a (x). The exactness subgradient has still g(x) ∈ [ɛ] f(x). Part (iii) leading to formula (26.17) remains the same with f a (x) instead of f(x). Part (iv) remains the same, and we obtain the analogue of (26.18) with f(x) replaced by f a (x).

Substantial changes occur in part (v) of the proof leading to formula (26.19). Indeed, consider without loss the case where \(t_{k}(x) > f_{a}(x)\). Then

$$\displaystyle\begin{array}{rcl} m_{k}(y,x)& =& f_{a}({y}^{k}) + g_{\varepsilon k}^{\top }(y - {y}^{k}) - s_{ k} {}\\ & =& f_{a}(x) + g_{\varepsilon k}^{\top }(y - x) - c\|x - {y{}^{k}\|}^{2}, {}\\ \end{array}$$

as in the proof of Lemma 26.9, and therefore

$$\displaystyle\begin{array}{rcl} f_{a}({y}^{k})-m_{ k}({y}^{k},x)& =& f_{ a}({y}^{k})-f_{ a}(x)-g_{k}^{\top }({y}^{k} - x)+{(g_{ k}-g_{\varepsilon k})}^{\top }(x-{y}^{k})+c\|x - {y{}^{k}\|}^{2}.{}\\ \end{array}$$

Since f is ɛ′-convex, we have \(g_{k}^{\top }(x - {y}^{k}) \leq f(x) - f({y}^{k}) +\varepsilon {^\prime}\|x - {y}^{k}\|\). Hence

$$\displaystyle{f_{a}({y}^{k}) - m_{ k}({y}^{k},x) \leq f(x) - f_{ a}(x) -\left (f({y}^{k}) - f_{ a}({y}^{k})\right ) + (\varepsilon {^\prime} +\varepsilon +\nu _{ k})\|x - {y}^{k}\|,}$$

where ν k → 0. Now we use axiom (26.42), which gives

$$\displaystyle{f_{a}({y}^{k}) - m_{ k}({y}^{k},x) \leq (\varepsilon {^\prime \prime} +\varepsilon {^\prime} +\varepsilon +\delta _{ k} +\nu _{k})\|x - {y}^{k}\|,}$$

for \(\delta _{k},\nu _{k} \rightarrow 0\). Subtracting the usual quadratic expression on both sides gives \(f_{a}({y}^{k}) - M_{k}({y}^{k},x) \leq (\varepsilon {^\prime \prime} +\varepsilon {^\prime} +\varepsilon +\delta _{k} +\tilde{\nu } _{k})\|x - {y}^{k}\|\) with \(\delta _{k},\tilde{\nu }_{k} \rightarrow 0\). Going back with this estimation to the expansion \(\tilde{\rho }_{k} \leq \rho _{k} + \frac{\varepsilon {^\prime \prime}+\varepsilon {^\prime}+\varepsilon } {\eta }\) shows \(\eta < \frac{\varepsilon {^\prime \prime}+\varepsilon {^\prime}+\varepsilon } {\tilde{\gamma }-\gamma }\) as in the proof of Lemma 26.9, where η = dist(0,∂ ϕ(x,x)). Since ∂ ϕ(x,x) ⊂ [ɛ] f(x) by Lemma 26.16, we have 0 ∈ [ɛ+η] f(x). This proves the result.■

Lemma 26.20.

Suppose the inner loop turns forever and τ k is frozen from some counter k onwards. Then 0 ∈ ∂ [ɛ] f(x).

Proof.

Replacing f(x) by f a (x), the proof proceeds in exactly the same fashion as the proof of Lemma 26.11. We obtain 0 ∈ ∂ ϕ(x,x) and use Lemma 26.16 to conclude 0 ∈ [ɛ] f(x). ■

As we have seen, axiom (26.42) was necessary to deal with the case τ k in Lemma 26.19, while Lemma 26.20 gets by without this condition. Altogether, that means we have to adjust the stopping test in step 2 of the algorithm to \(0 \in \partial _{[\tilde{\varepsilon }]}f({x}^{j})\), where \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime \prime} +\varepsilon {^\prime}+\varepsilon )/(\tilde{\gamma }-\gamma )\). As in the case of exact function values, we may delegate the stopping test to the inner loop, so if the latter halts due to insufficient progress, we interpret this as \(0 \in \partial _{[\tilde{\varepsilon }]}f({x}^{j})\), which is the precision we can hope for. Section 26.8 below gives more details.

Let us now scan through the proof of Theorem 26.12 and see what changes occur through the use of inexact function values \(f_{a}({y}^{k})\), \(f_{a}({x}^{j})\).

Theorem 26.21.

Let x 1 be such that \(\Omega {^\prime} =\{ x \in {\mathbb{R}}^{n}: f(x) \leq f({x}^{1}) + 2\bar{\varepsilon }\}\) is bounded. Suppose f is ɛ′-convex on Ω, that subgradients are drawn from ∂ [ɛ] f(y), and that inexact function values f a (y) satisfy \(\vert f(y) - f_{a}(y)\vert \leq \bar{\varepsilon }\) . Suppose axiom (26.42) is satisfied. Then every accumulation point \(\bar{x}\) of the sequence x j satisfies \(0 \in \partial _{[\tilde{\varepsilon }]}f(\bar{x})\) , where \(\tilde{\varepsilon }=\varepsilon +(\varepsilon {^\prime \prime} +\varepsilon {^\prime}+\varepsilon )/(\tilde{\gamma }-\gamma )\) .

Proof.

Notice that \(\tilde{\varepsilon }\) used in the stopping test has a different meaning than in Theorem 26.21. Replacing f(x j) by \(f_{a}({x}^{j})\) and \(f({y}^{k_{j}})\) by \(f_{a}({y}^{k_{j}})\), we follow the proof of Theorem 26.12. Part (i) is still valid with these changes. Notice that \(\Omega =\{ x: f_{a}(x) \leq f_{a}({x}^{1})\} \subset \Omega {^\prime}\) and Ω′ is bounded by hypothesis, so Ω is bounded.

As in the proof of Theorem 26.12 the set of all trial points \({y}^{1},\ldots,{y}^{k_{j}}\) visited during all the inner loops j is bounded. However, a major change occurs in part (ii). Observe that the accumulation point \(\bar{x}\) used in the proof of Theorem 26.12 is neither among the trial points nor the serious iterates. Therefore, \(f_{a}(\bar{x})\) is never called for in the algorithm. Now observe that the sequence \(f_{a}({x}^{j})\) is decreasing and by boundedness of Ω converges to a limit \(F_{a}(\bar{x})\). We redefine \(f_{a}(\bar{x}) = F_{a}(\bar{x})\), which is consistent with the condition \(\vert f_{a}(\bar{x}) - f(\bar{x})\vert \leq \bar{\varepsilon }\), because \(f_{a}({x}^{j}) \geq f({x}^{j})-\bar{\varepsilon }\), so that \(F_{a}(\bar{x}) \geq f(\bar{x})-\bar{\varepsilon }\).

The consequences of the redefinition of \(f_{a}(\bar{x})\) are that the upper envelope model ϕ is now jointly upper semicontinuous at \((\bar{x},\bar{x})\), and that the argument leading to formula (26.29) remains unchanged, because \(f_{a}({x}^{j}) \rightarrow \phi (\bar{x},\bar{x})\).

Let us now look at the longer argument carried out in parts (iii)–(ix) of the proof of Theorem 26.12, which deals with the case where \(\|g_{j}\| \geq \mu > 0\) for all j. Parts (iii)–(vii) are adapted without difficulty. Joint upper semicontinuity of ϕ at \((\bar{x} + h,\bar{x})\) is used at the end of (vii), and this is assured as a consequence of the redefinition \(f_{a}(\bar{x}) = F_{a}(\bar{x})\) of f a at \(\bar{x}\).

Let us next look at part (viii). In Theorem 26.12 we use ɛ′-convexity. Since the latter is in terms of exact values, we need axiom (26.42) for the sequence \({y}^{k_{j}-\nu _{j}} \rightarrow \bar{ x}\), similarly to the way it was used in Lemma 26.16. We have to check that despite the redefinition of f a at \(\bar{x}\) axiom (26.42) is still satisfied. To see this, observe that \({y}^{k_{j}-\nu _{j}}\) is a trial step which is rejected in the jth inner loop, so that its approximate function value is too large. In particular, \(f_{a}({y}^{k_{j}-\nu _{j}}) \geq f_{a}({x}^{j+1})\), because x j+1 is the first trial step accepted. This estimate shows that (26.42) is satisfied at \(\bar{x}\).

Using (26.42) we get the analogue of (26.36), which is

$$\displaystyle{f_{a}({y}^{k_{j}-\nu _{j} }) - M_{k_{j}-\nu _{j}}({y}^{k_{j}-\nu _{j} },{x}^{j}) \leq (\varepsilon {^\prime \prime} +\varepsilon {^\prime} +\nu _{ j} +\delta _{j})\|{y}^{k_{j}-\nu _{j} } - {x}^{j}\|}$$

for certain ν j ,δ j → 0. Estimate (26.40) remains unchanged, so we can combine the two estimates to obtain the analogue of (26.41) in part (ix), which is

$$\displaystyle{\tilde{\rho }_{k_{j}-\nu _{j}} \leq \rho _{k_{j}-\nu _{j}} + \frac{(1 {+\zeta }^{2})(\varepsilon {^\prime \prime} +\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}\eta }.}$$

Using the same argument as in the proof of Theorem 26.12, we deduce

$$\displaystyle{\eta \leq \frac{{(1+\zeta )}^{2}(\varepsilon {^\prime \prime} +\varepsilon {^\prime}+\varepsilon )} {{(1-\zeta )}^{2}(\tilde{\gamma }-\gamma )} }$$

for η = dist(0,∂ ϕ(x,x)). Since 0 <ζ < 1 was arbitrary, we obtain \(\eta \leq \frac{\varepsilon {^\prime \prime}+\varepsilon {^\prime}+\varepsilon } {\tilde{\gamma }-\gamma }\). Now as \(\bar{x}\) is a point of continuity of f a , Corollary 26.18 tells us that \(\eta = \text{ dist}(0,\partial _{[\varepsilon ]}f(\bar{x}))\). Therefore \(0 \in \partial _{[\varepsilon +\eta ]}f(\bar{x})\). Since \(\varepsilon +\eta =\tilde{\varepsilon }\), we are done.■

26.8 Stopping

In this section we address the practical problem of stopping the algorithm. The idea is to use tests which are based on the convergence theory developed in the previous sections.

In order to save time, the stopping test in step 2 of the algorithm is usually delegated to the inner loop. This is based on Lemmas 26.9 and  26.11 and the following.

Lemma 26.22.

Suppose tangent program (26.7) has the solution y k = x. Then 0 ∈ ∂ [ɛ] f(x).

Proof.

From (26.8) we have \(0 \in \partial \phi _{k}(x,x) \subset \partial \phi (x,x) \subset \partial _{[\varepsilon ]}f(x)\) by Lemma 26.16.

In [20] we use the following two-stage stopping test. Fixing a tolerance level tol > 0, if x + is the serious step accepted by the inner loop at x, and if x + satisfies

$$\displaystyle{\frac{\|x - {x}^{+}\|} {1 +\| x\|} < \text{tol},}$$

then we stop the outer loop and accept x + as the solution, the justification being Lemma 26.22. On the other hand, if the inner loop at x fails to find x + and either exceeds a maximum number of allowed inner iterations or provides three consecutive trial steps y k satisfying

$$\displaystyle{\frac{\|x - {y}^{k}\|} {1 +\| x\|} < \text{tol},}$$

then we stop the inner loop and the algorithm and accept x as the final solution. Here the justification comes from Lemmas 26.9 and 26.11.

Remark 26.23.

An interesting aspect of inexactness theory with unknown precisions ɛ,ɛ′,ɛ″ are the following two scenarios, which may require different handling. The first is when functions and subgradients are inexact or noisy, but we do not take this into account and proceed as if information were exact. The second scenario is when we deliberately use inexact information in order to gain speed or deal with problems of very large size. In the first case we typically arrange all elements of the algorithm like in the exact case, including situations where we are not even aware that information is inexact. In the second case we might introduce new elements which make the most of the fact that data are inexact.

As an example of the latter, in [30] where f is convex, the author does not use downshift with respect to f a (x), and as a consequence one may have \(\phi _{k}(x,x) > f_{a}(x)\), so that the tangent program (26.7) may fail to find a predicted descent step y k at x. The author then uses a sub-loop of the inner loop, where the τ-parameter is decreased until either a predicted descent step is found or optimality within the allowed tolerance of function values is established.

26.9 Example from Control

Optimizing the H -norm [4, 7, 19, 20] is a typical application of (26.1) where inexact function and subgradient evaluations may arise. The objective function is of the form

$$\displaystyle\begin{array}{rcl} f(x) =\max _{\omega \in \mathbb{R}}\overline{\sigma }\left (G(x,j\omega )\right ),& &{}\end{array}$$
(26.43)

where \(G(x,s) = C(x){\left (sI - A(x)\right )}^{-1}B(x) + D(x)\) is defined on the open set \(S =\{ x \in {\mathbb{R}}^{n}: A(x)\mbox{ stable}\}\) and where A(x), B(x), C(x), D(x) are matrix-valued mappings depending smoothly on \(x \in {\mathbb{R}}^{n}\). In other words, for xS each G(x,s) is a stable real-rational transfer matrix.

Notice that f is a composite function of the form \(f =\| \cdot \|_{\infty }\circ \mathcal{G}\), where \(\|\cdot \|_{\infty }\) is the H -norm, which turns the Hardy space \(\mathcal{H}_{\infty }\) of functions G which are analytic and bounded in the open right-half plane [53, p. 100] into a Banach space,

$$\displaystyle{\|G\|_{\infty } =\sup _{\omega \in \mathbb{R}}\overline{\sigma }\left (G(j\omega )\right ),}$$

and \(\mathcal{G}: S \rightarrow \mathcal{H}_{\infty }\), \(x\mapsto G(x,\cdot ) = C(x){(\cdot I - A(x))}^{-1}B(x) + D(x) \in \mathcal{H}_{\infty }\) is a smooth mapping, defined on the open subset \(S =\{ x \in {\mathbb{R}}^{n}: A(x)\mbox{ stable}\}\). Since composite functions of this form are lower C 2, and therefore also lower C 1, we are in business. For the convenience of the reader we also include a more direct argument proving the same result:

Lemma 26.24.

Let f be defined by (26.43) , then f is lower C 2 , and therefore also lower C 1 , on the open set \(S =\{ x \in {\mathbb{R}}^{n}: A(x)\mbox{ stable}\}\) .

Proof.

Recall that \(\overline{\sigma }(G) =\max _{\|u\|=1}\max _{\|v\|=1}\text{Re\,}u\,G{v}^{H}\), so that

$$\displaystyle{f(x) =\max _{\omega \in {\mathbb{S}}^{1}}\max _{\|u\|=1}\max _{\|v\|=1}\text{Re\,}u\,G(x,j\omega ){v}^{H}.}$$

Here, for xS, the stability of G(x, ⋅) assures that G(x,s) is analytic in s on a band \(\mathcal{B}\) on the Riemann sphere \(\mathbb{C} \cup \{\infty \}\) containing the zero meridian \(j{\mathbb{S}}^{1}\) with \({\mathbb{S}}^{1} =\{\omega:\omega \in \mathbb{R} \cup \{\infty \}\}\), a compact set homeomorphic to the real 1-sphere. This shows that f is lower C 2 on the open set S. Indeed, \((x,\omega,u,v)\mapsto F(x,\omega,u,v)\,:=\,\text{Re\,}u\,G(x,j\omega ){v}^{H}\) is jointly continuous on \(S \times {\mathbb{S}}^{1} \times {\mathbb{C}}^{m} \times {\mathbb{C}}^{p}\) and smooth in x, and \(f(x) =\max _{(\omega,u,v)\in K}F(x,\omega,u,v)\) for the compact set \(K = {\mathbb{S}}^{1} \times \{ u \in {\mathbb{C}}^{m}:\| u\| = 1\} \times \{ v \in {\mathbb{C}}^{p}:\| v\| = 1\|\}\). ■

The evaluation of f(x) is based on the iterative bisection method of Boyd et al. [10]. Efficient implementations use Boyd and Balakrishnan [11] or Bruisma and Steinbuch [12] and guarantee quadratic convergence. All these approaches are based on the Hamiltonian test from [10], which states that f(x) >γ if and only if the Hamiltonian

$$\displaystyle\begin{array}{rcl} & & H(x,\gamma ) = \left [\begin{array}{cc} A(x)& 0 \\ 0 & - A{(x)}^{\top } \end{array} \right ] -\left [\begin{array}{cc} 0 &B(x) \\ C{(x)}^{\top }&0(x) \end{array} \right ]{\left [\begin{array}{cc} \gamma I &D(x) \\ D{(x)}^{\top }& \gamma I \end{array} \right ]}^{-1}\left [\begin{array}{cc} C(x)& 0 \\ 0 & - B{(x)}^{\top } \end{array} \right ]{}\end{array}$$
(26.44)

has purely imaginary eigenvalues . The bundle method of [7], which uses (26.44) to compute function values, can now be modified to use approximate values \(f_{a}({y}^{k})\) for unsuccessful trial points y k. Namely, if the trial step y k is to become the new serious iterate x +, its value f(y k) has to be below f(x). Therefore, as soon as the Hamiltonian test (26.44) certifies f(y k) > f(x) even before the exact value f(y k) is known, we may dispense with the exact computation of f(y k). We may stop the Hamiltonian algorithm at the stage where the first γ with f(y k) >γf(x) occurs, compute the intervals where \(\omega \mapsto \overline{\sigma }\left (G(x,j\omega )\right )\) is above γ, take the midpoints of these intervals, say \(\omega _{1},\ldots,\omega _{r}\), and pick the one where the frequency curve is maximum. If this is ω ν , then \(f_{a}({y}^{k}) = \overline{\sigma }\left (G(x,j\omega _{\nu })\right )\). The approximate subgradient g a is computed via the formulas of [4] with ω ν replacing an active frequency. This procedure is trivially consistent with (26.42), because f(x) = f a (x) and f a (y) ≤ f(y).

If we wish to allow inexact values not only at trial points y but also at serious iterates x, we can use the termination tolerance of the Hamiltonian algorithm [11]. The algorithm works with estimates \(f_{l}(x) \leq f(x) \leq f_{u}(x)\) and terminates when \(f_{u}(x) - f_{l}(x) \leq 2\eta _{x}F(x)\), returning \(f_{a}(x)\,:=\,\left (f_{l}(x) + f_{u}(x)\right )/2\), where we have the choice \(F(x) \in \{ f_{l}(x),f_{u}(x),f_{a}(x)\}\). Then \(\vert f(x) - f_{a}(x)\vert \leq 2\eta _{x}\vert F(x)\vert \). As η x is under control, we can arrange that \(\eta _{x}\vert F(x)\vert \leq \eta _{y}\vert F(y)\vert + \text{o}(\|x - y\|)\) in order to assure condition (26.42).

Remark 26.25.

The outlined method applies in various other cases in feedback control where function evaluations use iterative procedures, which one may stop short to save time. We mention IQC-theory [2], which uses complex Hamiltonians, [7] for related semi-infinite problems, or the multidisk problem [3], where several H -criteria are combined in a progress function. The idea could be used quite naturally in the ɛ-subgradient approaches [36, 37] or in search methods like [1].