1 Introduction

The optimal control of variational inequalities is a natural extension of PDE-constrained optimization in which the forward problem or underlying PDE is replaced by a variational inequality or convex variational problem. There are a vast number of applications in which variational inequalities or convex variational problems are used. Perhaps the most well-known disciplines are continuum mechanics [32, 52] and mathematical image processing [67, 70, 72]. However, one also finds variational inequalities in such diverse areas as petroleum engineering [85], digital microfluidics [84], and mathematical finance [51].

Just as in PDE-constrained optimization, we are interested in optimizing or controlling the solution to variational inequalities through one or several parameters. For example, in a packaging problem, we might be interested in determining the optimal force distribution to apply to an elastic membrane in order to obtain a desired shape without passing through a fixed obstacle. In the petroleum engineering application mentioned above, one obtains a variational inequality when using the variational formulation for the steady laminar flow of drill mud modelled as a Bingham fluid. Here, one might wish to determine the optimal pressure needed to affect a desired flow regime.

Due to a lack of Fréchet (or even Gâteaux) differentiability of the control-to-state mapping, a reduced space optimization approach will require either concepts from variational analysis or approximation theory not only to prove existence of a solution but also to derive meaningful first-order optimality conditions and develop efficient numerical methods.

The theoretical and algorithmic challenges remain intact when considering a full-space approach, since the feasible set (after reformulating the variational inequality as a complementarity problem) is both non-convex and fails to satisfy the usual constraint qualifications. For example, in PDE-constrained optimization, one typically uses the Slater or Robinson-Zowe-Kurcyusz constraint qualifications, cf. [48, 77], to ensure the existence and boundedness of the set of Lagrange multipliers. In order to handle these degenerate constraints, one may take a penalty or relaxation approach, though recent work suggests that there are viable theoretical approaches to this “full-space” setting that do not rely on smoothing, see [81].

We now give a brief historical development. We caution the reader that this is only a sampling of the sizeable amount of work published over the past five decades. The purpose is merely to provide an understanding of the current motivations in theory and numerical methods, especially the ideas presented in this paper. Variational inequalities were first introduced by Signorini and solved by Fichera [29]. For this article, the early works by Brezis, Lions, and Stampacchia [18, 60] are perhaps the most relevant as they provide us with a sufficient existence and regularity theory. As a starting point for further study, we also suggest the well-known monographs by Kinderlehrer and Stampacchia and Rodrigues [53, 71]. Perhaps the earliest work on the optimal control of variational inequalities is due to Yvon, in which the adaptive penalty technique for variational inequalities (see Section 4.1) was used in order to approximate the control problem by a more tractable, parameter-dependent problem [86, 87]. This was later used by Lions [58, 59] and fully developed by Barbu [7]. The monograph by Barbu [7] contains many important results and techniques that are still used today, see also related contributions in [9,10,11,12, 50, 68, 73]. In 1976, Mignot offered an alternative to the smoothing approach by developing concepts of generalized differentiability [63]. There, he was able to prove and obtain an explicit formula for the Hadamard directional differentiability of the control-to-state mapping of the variational inequality and, with this result, derive first-order optimality conditions. This result later appeared in [64] and is rederived in [45] using techniques from variational analysis. More recent work in the control of variational inequalities has focused on the development of efficient, function-space-based numerical methods and ever more complex settings as in [2, 4, 21,22,23, 34,35,36, 39, 40, 45, 49, 56, 65, 66, 80, 83].

Parallel to the infinite-dimensional developments, a great deal of progress has been made on theory and numerical methods for finite-dimensional mathematical programs with equilibrium or complementarity constraints, as evidenced by the well-known manuscripts by Luo, Pang, and Ralph [61] and Outrata, Kočvara, and Zowe [69]. In addition to the references therein, we also mention the approaches in [3, 74], which have since been extended to infinite dimensions and are in part featured in this paper.

In reference to the existing finite-dimensional literature, recent works have begun referring to the problem class considered here as elliptic mathematical programs with equilibrium constraints (or elliptic MPECs). We will henceforth do the same for the sake of brevity.

The article is structured as follows. After introducing some necessary notation, data assumptions, and the canonical example in Section 2, we discuss sufficient conditions for the existence of a solution in Section 3. Afterwards, we give several kinds of multiplier-based first-order stationarity conditions similar to classical Karush-Kuhn-Tucker (KKT) conditions. These may or may not be first-order optimality conditions depending on the regularity and data of the given problem. For a newcomer to elliptic MPECs, this potentially puzzling array of possible stationarity conditions may seem strange. However, it is necessary to understand the gap in what is theoretically the “best” type of KKT point and what type of stationary point a given numerical method can theoretically guarantee (at least asymptotically).

Following the theoretical results in Section 3, we pass to the main focus of our discussion: numerical optimization methods for elliptic MPECs. First, in Section 4, we consider what we refer to as “regularization-based” methods. In Section 4.1, we present a typical adaptive smoothing method that makes use of approximation theory for variational inequalities and which is directly linked to the derivation of first-order stationarity conditions for the elliptic MPEC. Similarly, in Section 4.2, we use a simple penalization of the complementarity condition (in weak form) to obtain a simpler PDE-constrained optimization problem with control and state constraints. The numerical solution of the subproblems in the adaptive penalty method is thoroughly discussed for a canonical example problem and the numerical behavior is illustrated. We note that the smoothed problems become increasingly difficult to solve as the smoothing parameter tends to zero, which in part motivates the desire for “non-smooth” methods.

The structure of the non-smooth numerical methods section is devised to illustrate the parallels to some popular methods for smooth PDE-constrained optimization problems, e.g., projected-gradient methods (Section 5.1), direct solvers for the KKT system (Section 5.2), and globalization of the direct solvers via a line search (Section 5.3). There are obvious (and expected) limitations to these approaches, all of which arise from the non-smooth or degenerate nature of elliptic MPECs. Nevertheless, they indicate the possibility of inventing new efficient numerical methods beyond the adaptive smoothing/penalty paradigm, which often outperform the smooth methods in practice.

In Section 5.1, we present a new subgradient method for the reduced space problem. The choice of “subgradient” is motivated by the limiting variational calculus found, e.g., in the book by Mordukhovich [66]. We discuss the asymptotic behavior of this method and illustrate the potential for future study. Following this in Section 5.2, an active-set-based solver with feasibility restoration suggested by Hintermüller in [36] as a direct solver for the C-stationarity system is considered. Finally, in Section 5.3, we consider the so-called bundle-free implicit programming approach [46], which can be understood as a globalization of the active-set solver using a line search. We extend the latter to include control constraints. The performance of the non-smooth methods is then compared and contrasted with that of the adaptive penalty method. All numerical examples have been solved in the Julia programming language [14] version 0.5.0 on a 2016 MacBook Pro Intel(R) Core(TM) i5 @ 3.1 GHz with 16 GB 2133 MHz LPDDR3 RAM.

2 Notation, Assumptions, and Preliminary Results

In this section, we fix our notation and analytical framework. Elliptic variational inequalities and elliptic MPECs are introduced.

2.1 Norms, Inner Products, and Convergence

All spaces are based on the real number field. The Euclidean norm and scalar product on \(\mathbb {R}^m\) are denoted by |x| and x ⋅ y for \(x,y \in \mathbb {R}^m\), respectively. For \(r \in \mathbb {R}\), we denote \(\max (0,r)\) by (r)+. All other norms are denoted by ∥⋅∥X for some space X. The topological dual of X is denoted by X and the dual pairing by \(\langle \cdot , \cdot \rangle _{X,X^*}\). The inner product on a Hilbert space H is denoted by (⋅, ⋅)H. Given a sequence {xk}⊂ X, we denote strong convergence to \(\overline {x} \in X\) by \(x_k \stackrel {X}{\to } \overline {x}\); for weak convergence, we use \(x_k \stackrel {X}{\rightharpoonup } \overline {x}\) and for weak-star convergence, \(x_k \stackrel {X}{\rightharpoonup ^*} \overline {x}\). In all these cases, we drop the sub- or superscripts if it is clear in context.

2.2 Extended Real-Valued Functionals and Convex Analysis

For a topological vector space U, the extended real-valued functional \(G:U \to \overline {\mathbb {R}}\) is proper if G(u) > − for all u ∈ U and G(w) < + for some w ∈ U. G is said to be closed or lower-semicontinuous if its epigraph

$$\displaystyle \begin{aligned} \mathrm{epi}\, G:= \left\{ (u,\alpha) \in U \times \mathbb{R} \left|\; G(u) \le \alpha \right.\right\} \end{aligned}$$

is a closed subset in the product topology on \(U \times \mathbb {R}\). Moreover, G is convex provided

$$\displaystyle \begin{aligned} G(\lambda u + (1-\lambda) w)) \le \lambda G(u) + (1-\lambda)G(w),\quad \forall u,w \in U, \forall \lambda \in [0,1]. \end{aligned}$$

For a Banach space V  and \(F:V \to \overline {\mathbb {R}}\), the (convex) subdifferential of F at some point x ∈ V  such that − < F(x) < + is defined by the potentially empty set:

$$\displaystyle \begin{aligned} \partial F(x) := \left\{x^* \in V^* \left|\; F(y) \ge F(x) + \langle x^*,y-x\rangle,\forall y \in V \right.\right\}. \end{aligned}$$

If C ⊂ V , then the indicator functional iC of C is defined by

$$\displaystyle \begin{aligned} i_{C}(x) = 0 \text{ if } x \in C,\; \text{ and } i_{C}(x) = +\infty \text{ otherwise.} \end{aligned}$$

If C is non-empty, closed, and convex, then \(\mathcal {N}_C(x) := \partial i_{C}(x)\) is called the convex normal cone to C at x ∈ C. If K ⊂ V  is a non-empty, closed, convex cone, then

$$\displaystyle \begin{aligned} \xi \in \mathcal{N}_K(x)\ \Leftrightarrow\ x \in K, \xi \in K^{-} := \left\{\mu \in V^* \left| \langle \mu,y \rangle \le 0,\forall y \in K \right.\right\} \text{ and }\langle \xi, x \rangle = 0. \end{aligned}$$

Recall that K referred to as the polar cone to K. For more on convex analysis, see, e.g., [25].

2.3 Function Spaces and 2-Capacity

Unless noted, \(\varOmega \subset \mathbb {R}^n\), \(n \in \left \{1,2,3\right \}\), is a non-empty, bounded, and open subset with Lipschitz boundary Γ = ∂Ω. The Lebesgue measure of \(E \subset \mathbb {R}^n\) is denoted by meas(E). For p ∈ R with 1 ≤ p < +, we denote the usual Lebesgue space of p-integrable functions by Lp(Ω) and the Lebesgue space of essentially bounded functionals by L(Ω). The norms are given by

$$\displaystyle \begin{aligned} \| u \|{}_{L^p} = \left(\int_{\varOmega} |u(\omega) |{}^p d\omega\right)^{1/p} \text{ and } \| u \|{}_{L^{\infty}} = \text{ess sup}_{\omega \in \varOmega}| u(\omega) |, \end{aligned}$$

respectively. For \(k \in \mathbb {N}\), we denote the Sobolev space of Lp-functions u with \(| D^{\alpha }u |= |(\partial ^{\alpha _1}u,\dots ,\partial ^{\alpha _n}u)| \in L^p(\varOmega )\) by Wk, p(Ω), where α = (α1, …, αn) is a multi-index with |α1| + ⋯ + |αn|≤ k and \(\partial ^{\alpha _i}u\) is the αi-th weak partial derivative of u with respect to xi, i ∈{1, 2, 3}. The Wk, p-norms are then defined by

$$\displaystyle \begin{aligned} \| u \|{}_{W^{k,p}} = \left(\sum_{|\alpha|\le k}\int_{\varOmega} |D^{\alpha} u(\omega) |{}^p d\omega\right)^{1/p} \text{ and } \| u \|{}_{W^{k,\infty}} = \sum_{|\alpha|\le k}\text{ess}\ \text{sup}_{\omega \in \varOmega}| D^{\alpha} u(\omega) |.\end{aligned} $$

For our discussion, the most important case is when p = 2. Here, both L2(Ω) and Hk(Ω) := Wk, 2(Ω) are Hilbert spaces with inner product defined using the norms given above. We denote the space of all H1-functions with zero trace by \(H^1_0(\varOmega )\) and its dual by H−1(Ω). It follows from the Poincaré inequality that

$$\displaystyle \begin{aligned} \| u \|{}_{H^1_0} = \|\nabla u\|{}_{L^2} = (\nabla u,\nabla u)_{L^2}^{1/2}. \end{aligned}$$

is an equivalent norm on \(H^1_0(\varOmega )\). See, e.g., [1] for more. Finally, recall that the 2-capacity of an arbitrary subset E ⊂ Ω is given by

$$\displaystyle \begin{aligned} \mathrm{Cap}_2(E,\varOmega) := \inf\left\{\| u \|{}^2_{H^1_0}: v \in H^1_0(\varOmega)\; v \ge 1 \text{ a.e. on open neighborhood }G \supset E \right\}. \end{aligned}$$

cf. [6, Prop. 5.8.3]. Note that \(H^1_0\)-functions possess a representative that is continuous up to a set of positive capacity, cf. [26]. Moreover, there exist E ⊂ Ω with Cap2(E, Ω) > 0 and meas(E) = 0, e.g., the boundary of a smooth open set. Hence, an \(H^1_0\)-function in 2D is continuous across a smooth curve in the plane, but not at single points.

2.4 Elliptic Variational Inequalities

Let H be a Hilbert space and H its topological dual. For this subsection, (⋅, ⋅) denotes the inner product and ∥⋅∥ the norm on H. The pairing between H and H is denoted by 〈⋅, ⋅〉. Let \(a: H \times H \to \mathbb {R}\) be a bilinear form on H and A : H → H the associated bounded linear operator, i.e., a(u, v) = 〈Au, v〉, u, v ∈ H. We recall that a is said to be coercive/elliptic, if there exists some constant c > 0 such that a(v, v) ≥ cv2 for all v ∈ H. Clearly, the function ∥va := (a(v, v))1∕2 defines an equivalent norm on H. Let K ⊂ H be non-empty, closed, and convex and w ∈ H. Then, the variational problem

$$\displaystyle \begin{aligned} \text{Find } u \in K: a(u,v-u) \ge \langle w, v - u \rangle,\quad \text{ for all } v \in K \end{aligned} $$
(1)

is called an elliptic variational inequality. Note that problems of this type are often referred to as variational inequalities of the first kind. They can be equivalently written as a generalized equation:

$$\displaystyle \begin{aligned} \text{Find } u \in K: Au + \mathcal{N}_K(u) = Au + \partial i_K(u) \ni w. \end{aligned} $$
(2)

If we replace iK by a subdifferentiable proper closed convex functional φ, then we obtain a variational inequality of the second kind. The following is due to Lions and Stampacchia [60], see also [53]:

Theorem 1

Let \(a:H \times H \to \mathbb {R}\) be a coercive bilinear form, K  H closed and convex, and w  H. Then, (1) possesses a unique solution S(w). Moreover, the solution mapping wS(w) is Lipschitz continuous: For all w1, w2 ∈ H, it holds that

$$\displaystyle \begin{aligned} \|S(w_1) - S(w_2)\|{}_{H} \le (1/c)\|w_1 - w_2\|{}_{H^*}, \end{aligned}$$

where c is the constant of coercivity of a.

Although the solution mapping S is Lipschitz continuous, it is not necessarily differentiable. In some special cases, S is directionally differentiable. For more on variational inequalities, see, e.g., [53]. The conditions on A and K will be considered standing assumptions on the variational inequality for the rest of this article. We conclude this subsection with an example.

Example 1

In the context of (1), let \(H:= H^1_0(\varOmega )\) and H = H−1(Ω). Moreover, define Ψ ∈ H1(Ω) with Ψ|Γ ≤ 0 and \(K \subset H^1_0(\varOmega )\) such that

$$\displaystyle \begin{aligned} K := \left\{ u \in H^1_0(\varOmega) \left|\; u(x) \ge \varPsi(x),\; \text{ for almost every (a.e.) } x \in \varOmega \right.\right\}. \end{aligned}$$

Clearly, since \(\max (0,\varPsi ) \in H^1_0(\varOmega )\), K≠∅. Moreover, one readily shows that K is closed and convex in \(H^1_0(\varOmega )\). The setting of (1) is general enough to allow for nonsymmetric bilinear forms a(u, v), e.g.,

$$\displaystyle \begin{aligned} a(u,v) := \int_{\varOmega} \sum_{i,j} a_{ij}\partial_i u \partial_j v - \sum_{i} b_i (\partial_i u) v + c u v \mathrm{d}x,\quad u,v \in H^1_0(\varOmega), \end{aligned}$$

with appropriate assumptions on aij, bi, c. However, for our purposes it suffices to consider

$$\displaystyle \begin{aligned} a(u,v) := \int_{\varOmega} \nabla u \cdot \nabla v \mathrm{d}x,\quad u,v \in H^1_0(\varOmega) \end{aligned}$$

Letting f ∈ H−1(Ω) and combining the above, we obtain a classical obstacle problem

$$\displaystyle \begin{aligned} \text{Find } u \in K: \int_{\varOmega} \nabla u\cdot \nabla [v-u] \mathrm{d}x \ge \langle f, v - u \rangle,\quad \text{ for all } v \in K.\end{aligned} $$

If Ψ|Γ ≡ 0, then (without altering the boundary conditions), we obtain an equivalent problem:

$$\displaystyle \begin{aligned} \text{Find } u \in K_0: \int_{\varOmega} \nabla u\cdot \nabla [v-u] \mathrm{d}x \ge \langle f + \varDelta \varPsi, v - u \rangle,\quad \text{ for all } v \in K_0, \end{aligned}$$

where \(K_0 = \left \{ u \in H^1_0(\varOmega ) \left |\; u(x) \ge 0,\; \text{ a.e. } x \in \varOmega \right .\right \};\) see, e.g., [53, 71].

Of course, this set K represents perhaps the easiest kind of set that one can consider in a variational inequality. It is just one example of a so-called “polyhedric set,” a notion introduced in the mid-1970s by Haraux [33] and Mignot [63] in the context of sensitivity analysis of variational inequalities. See the recent detailed study [82] for a state-of-the-art on polyhedricity. For more general closed, convex sets, e.g.,

$$\displaystyle \begin{aligned} \left\{ u \in H^1_0(\varOmega) \left|\; | (\nabla u)(x) | \le \varPsi(x),\; \text{a.e. } x \in \varOmega \right.\right\} \end{aligned} $$
(3)

it is possible to prove existence, uniqueness, and continuity results under weak assumptions on Ψ. Since these sets are neither polyhedric nor cones, the differential sensitivity analysis of the solution map and subsequent derivation of optimality conditions or efficient numerical methods is extremely challenging. Elliptic MPECs with this type of constraint were considered in [45]. However, we caution the reader that the proofs for the derivation of the tangent cones are in fact erroneous. Thus, the differential sensitivity results for the solution mapping only hold under the assumption that the tangent cones do in fact have the form purported in the text.

2.5 Elliptic Mathematical Programs with Equilibrium Constraints

We present an abstract framework for a class of elliptic MPECs. Let V , H, and Z be separable Hilbert spaces such that the state space V  is a dense subset of H and V ⊂ H ⊂ V both algebraically and topologically. Moreover, assume that f ∈ V, Zad ⊂ Z is a non-empty, closed, and convex set (the set of admissible controls/decision variables), B : Z → V is a bounded linear operator and \(F:U \to \overline {\mathbb {R}}\) and \(G: Z \to \overline {\mathbb {R}}\). We define an elliptic MPEC as follows:

$$\displaystyle \begin{aligned} \min\ & J(z,u) := F(u) + G(z) \text{ over } (z,u) \in Z \times V, \end{aligned} $$
(4a)
$$\displaystyle \begin{aligned} \text{s.t. } &z \in Z_{\mathrm{ad}},\; u \text{ solves }~(1) \text{ with } w := Bz + f. \end{aligned} $$
(4b)

In light of Theorem 1, we can rewrite (4) in reduced form, analogously to standard PDE-constrained optimization problems:

$$\displaystyle \begin{aligned} \min\ \mathcal{J}(z) := F(S(Bz)) + G(z) \text{ over } z\in Z_{\mathrm{ad}}. \end{aligned} $$
(5)

Here, f ∈ V is fixed. Therefore, we only write S(Bz) for the solution mapping (instead of S(Bz + f)). Obtaining a meaningful full-space formulation of the optimization problem (4) is not always possible. However, if K in (1) is a cone, then we can also formulate a full-space version of the elliptic MPEC by introducing a slack variable ξ ∈ V:

$$\displaystyle \begin{aligned} \min\ &J(z,u) \text{ over } (z,u,\xi) \in Z \times V \times V^*, \end{aligned} $$
(6a)
$$\displaystyle \begin{aligned} \text{s.t. } &z \in Z_{\mathrm{ad}},{} \end{aligned} $$
(6b)
(6c)
(6d)

Here, K+ := −K. This follows from (2). Due to the presence of the complementarity condition (6d), problems of the type (6) are often referred to in the finite-dimensional literature (and more recently in the infinite-dimensional literature) as (elliptic) mathematical programs with complementarity constraints (MPCCs); see e.g., [61, 69, 74] (finite dimensions) and the recent work [81] (infinite dimensions).

Remark 1

We consider here the simplest case in which the control or decision variable z appears only on the “right-hand side” of the variational inequality. However, in many interesting applications, e.g., topology optimization [8], the control enters nonlinearly through the differential operator A, e.g., when A(z)u = −div(zu). In the context of elliptic MPECs, this case is scarcely covered in the literature, see [36].

We finalize this section with a canonical example of an elliptic MPEC.

Example 2

In the notation of (4), we set \(V = H^1_0(\varOmega )\), H = L2(Ω), V = H−1(Ω), and Z = L2(Ω). For some ud ∈ L2(Ω), we let F be a standard tracking-type functional and G an L2-Tikhonov regularization:

$$\displaystyle \begin{aligned} J(z,u) := F(u) {+} G(z) = \frac{1}{2}\|u {-} u_d\|{}^2_{L^2} {+} \frac{\alpha}{2}\|z\|{}^2_{L^2},\quad (z,u) \in L^2(\varOmega) \times H^1_0(\varOmega),\, \alpha > 0.\end{aligned} $$

For the forward problem, we choose the simplest form of the obstacle problem with Ψ ≡ 0. Then, the following is an elliptic MPEC:

$$\displaystyle \begin{aligned} \min\ &\frac{1}{2}\|u - u_d\|{}^2_{L^2} + \frac{\alpha}{2}\|z\|{}^2_{L^2},\text{ over } (z,u) \in L^2(\varOmega) \times H^1_0(\varOmega), \end{aligned} $$
(7a)
$$\displaystyle \begin{aligned} \text{s.t. } &z \in Z_{\mathrm{ad}},\; u \in K_0 \text{ solves }: \int_{\varOmega} \nabla u\cdot \nabla [v-u] \mathrm{d}x \ge \langle Bz + f, v - u \rangle,\quad \text{ for all } v \in K_0,\end{aligned} $$
(7b)

Some possibilities for Zad are local bilateral constraints: − 1 ≤ w(x) ≤ 1, a.e. x ∈ Ω or global constraints such as \( \|w\|{ }_{L^2} \le 1 \).

3 Existence and Stationarity Conditions

We start this section in the abstract framework of Section 2.5. In order to provide insight into the deeper meaning of the various stationarity conditions, we ultimately restrict ourselves to the canonical example (Example 2).

3.1 Existence of a Solution

Under suitable conditions, on F, G, Zad, and B, we can use Weierstrass’s existence theorem, see, e.g., [6, Thm. 3.2.5] to prove that the reduced form MPEC has a solution. For the following result, we appeal to the monograph by Barbu [7, Chap. 3].

Theorem 2

In the context of problem (4), we assume the following:

  1. 1.

    \(F:H \to \mathbb {R}\) is locally Lipschitz and nonnegative.

  2. 2.

    \(G:Z \to \overline {\mathbb {R}}\)is convex, lower-semicontinuous, and for some constants κ1 > 0,\(\kappa _ 2 \in \mathbb {R}\). G(z) ≥ κ1zZ + κ2,z  Z

  3. 3.

    B is completely continuous.

Then, (4), or equivalently (5), admits a solution.

Proof

By replacing G(z) with \(G(z) + i_{Z_{ad}}(z)\), the assertion follows from [7, Prop. 3.1]. In particular, we note that by Barbu [7, Lem. 3.1] the mapping (S ∘ B) : Z → V  is completely continuous and, furthermore, the composite objective functional \((F \circ S \circ B):Z \to \mathbb {R}_+\) is weakly lower-semicontinuous.

Since for any z ∈ Zad, (1) has a unique solution u = S(Bz). There exists a unique \(\xi \in \mathcal {N}_K(u)\) such that ξ = Au − Bz − f. Conversely, if (z, u, ξ) satisfies (6b)–(6d), then z ∈ Zad and u = S(Bz). In other words, there is a one-to-one correspondence between the feasible set of (5) and (6). Therefore, if (4), or equivalently (5), admits a solution, then so does (6).

Corollary 1

Under the assumptions of Theorem 2 , (6) admits a solution.

Remark 2

Ignoring Theorem 2, we could also prove this assertion by appealing to [6, Thm. 3.2.5]. Indeed, under the hypotheses of Theorem 2, (6) can be viewed as an unconstrained problem in which the associated objective functional is radially unbounded, proper, and weakly lower-semicontinuous.

3.2 Primal Stationarity Conditions

In this subsection, we restrict ourselves to the setting of Example 2. As mentioned earlier, there are some situations in which the control-to-state mapping S is directionally differentiable. In particular, Mignot [63] demonstrated that the control-to-state mapping in Example 2 is directionally differentiable in the sense of Hadamard. The next result draws the parallel to PDE-constrained optimization problems.

Theorem 3

Let \((\overline {z},\overline {u})\) be a solution to (7), then the following first-order necessary optimality condition holds:

$$\displaystyle \begin{aligned} \alpha (\overline{z}, w - \overline{z}) + (\overline{u}-u_d, S'(\overline{u}; B(w - \overline{z}))) \ge 0,\quad \forall w \in Z_{\mathrm{ad}}. \end{aligned} $$
(8)

In the finite-dimensional literature, e.g., in [61, 74], primal first-order conditions of the type (8) are often referred to as B-stationarity conditions. There, the “B” comes from the fact that either the so-called Bouligand tangent cone or Bouligand differentiability is used. This motivates the next definition.

Definition 1

If \((\overline {z},\overline {u})\) is a feasible point of (7) that satisfies (8), then \((\overline {z},\overline {u})\) is B-stationary.

For the setting in (5), it is possible to derive B-stationarity conditions provided that F, G are smooth. Clearly, the main challenge is determining directional differentiability of S. Suppose for the sake of argument that \(S'(\overline {u}; du)\) is linear in du ∈ V. Then, we can “dualize” the B-stationarity conditions (8):

$$\displaystyle \begin{aligned} \alpha (\overline{z}, w - \overline{z}) + (B^*p, w - \overline{z}) \ge 0,\quad \forall w \in Z_{\mathrm{ad}},\quad p = S'(\overline{u})^*(u-u_d). \end{aligned}$$

Note that the condition “\(p = S'(\overline {u})^*(u-u_d)\)” essentially means (in this ideal case) that p solves a type of adjoint equation. Using this “adjoint state” p, we could now develop solution algorithms. However, \(S'(\overline {u};du)\) is in general nonlinear ( albeit Lipschitz) in du and thus, the standard adjoint state p does not exist. As a result, there are several different types of dual/multiplier-based first-order stationarity conditions reminiscent of classical KKT conditions in nonlinear programming.

3.3 Dual Stationarity Conditions

In this subsection, we again restrict ourselves to the setting in Example 2. Furthermore, we assume that Γ is regular enough to guarantee that \(S(Bz) \in H^2(\varOmega ) \cap H^1_0(\varOmega )\) for any z ∈ Zad. This is possible if, e.g., Bz + f ∈ L2(Ω) and Ω is a convex polyhedron, cf. [18, 71]. This regularity assumption ensures that ξ ∈ L2(Ω) and, by the Sobolev embedding theorem, S(Bz) is at least Hölder continuous on Ω. We now introduce two multiplier-based first-order stationarity conditions taken from [39, 40].

Definition 2

Let z ∈ Zad and u = S(Bz) with associated slack variable ξ ∈ L2(Ω). The set \(\varOmega ^0 := \left \{x \in \varOmega \left |\; u(x) = 0 \right .\right \}\) is called the active set or coincidence set and \(I := \left \{x \in \varOmega \left |\; u(x) > 0 \right .\right \}\) the inactive set. The set \(\varOmega ^{00} := \varOmega ^0 \cap \left \{x \in \varOmega \left |\; \xi (x) = 0\right .\right \}\) is called the biactive or weakly active set and \(\varOmega ^{0+} := \varOmega ^0 \cap \left \{x \in \varOmega \left |\; \xi (x) > 0\right .\right \}\) the strongly active set.

Note that the biactive and strongly active sets are only defined up to a set of Lebesgue measure zero. In contrast, the active and inactive sets may be more finely defined up to sets of positive capacity even in less regular settings.

Definition 3

The point \((z,u,\xi ) \in L^2(\varOmega ) \times H^1_0(\varOmega ) \times L^2(\varOmega )\) is called a C-stationary point for (7) provided that there exist \(p \in H^1_0(\varOmega )\) and λ ∈ H−1(Ω) such that the following system of equations is fulfilled:

$$\displaystyle \begin{aligned} \alpha (\overline{z}, w - \overline{z}) + (B^*p, w - \overline{z}) &\ge 0, \quad \forall w \in Z_{\mathrm{ad}} \end{aligned} $$
(9a)
$$\displaystyle \begin{aligned} A^*p - \lambda &= u_d - u \end{aligned} $$
(9b)
$$\displaystyle \begin{aligned} Au - \xi &= Bz + f \end{aligned} $$
(9c)
$$\displaystyle \begin{aligned} \xi &\ge 0,\text{ a.e. } x \in \varOmega {} \end{aligned} $$
(9d)
$$\displaystyle \begin{aligned} u &\ge 0,\text{ a.e. } x \in \varOmega {} \end{aligned} $$
(9e)
$$\displaystyle \begin{aligned} (\xi,u) &= 0 {} \end{aligned} $$
(9f)
$$\displaystyle \begin{aligned} \langle \lambda, p\rangle &\le 0 {} \end{aligned} $$
(9g)
$$\displaystyle \begin{aligned} p &= 0,\text{ a.e. } x \in \varOmega^{0+} {} \end{aligned} $$
(9h)
$$\displaystyle \begin{aligned} \langle \lambda, \phi \rangle &= 0, \forall \phi \in H^1_0(\varOmega): \phi = 0,\text{ a.e. } \varOmega^0 {} \end{aligned} $$
(9i)

If, in addition to (9), p and λ satisfy:

$$\displaystyle \begin{aligned} p &\le 0,\text{ a.e. } x \in \varOmega^{00} {} \end{aligned} $$
(10a)
$$\displaystyle \begin{aligned} \langle \lambda, \phi \rangle &\ge 0, \forall \phi \in H^1_0(\varOmega): \phi \ge 0,\text{ a.e. } \varOmega^{00}, \phi = 0,\text{ a.e. } \varOmega \setminus (I \cup \varOmega^{00}), {} \end{aligned} $$
(10b)

then \((z,u,\xi ) \in L^2(\varOmega ) \times H^1_0(\varOmega ) \times L^2(\varOmega )\) is called an S-stationary point.

Several comments on this system are necessary. To start, we note that if K0 is replaced by the entire space \(H^1_0(\varOmega )\), then ξ, λ and the conditions (9d)–(10b) vanish and we obtain the usual first-order system for a linear-quadratic PDE-constrained optimization problem with convex control constraints. Moreover, if there is no biactive set, then both (in fact all) dual stationarity concepts coincide. Thus, biactivity is essentially the source of all difficulties in the study of MPECs. We will see later that it also relates to the Gâteaux differentiability of the control-to-state map.

Perhaps the most critical point here is the usage of almost everywhere versus quasi everywhere conditions for the multipliers. As defined above, these S-stationarity conditions are not equivalent to the necessary first-order optimality conditions in the pioneering works [63, 64]; unless certain regularity assumptions on the active/biactive/inactive sets are made. In [63, 64] (see also [43, 80]), the notion of capacity (in contrast to Lebesgue measure) is used. This is because capacity is needed for the correct representation of the tangent/contingent cone and its polar. When replacing q.e. by a.e. some conditions become stronger and others finer.

In order to see this, suppose that meas(Ω0) = 0, but Cap2(Ω0, Ω) > 0. Then, (9i) would imply λ = 0 ∈ H−1(Ω). However, when using capacity, since the test functions ϕ admit quasi-continuous representatives, requiring ϕ = 0 “quasi-everywhere” (q.e.) on Ω0 would shrink the set of test functions so that λ is not necessarily zero. Similar problems arise by requiring a.e. instead of q.e. conditions on the adjoint variables p. Indeed, if Ω0+ has measure zero, then (9h) is trivially satisfied by any p; this would not be the case if we use q.e. since \(p \in H^1_0(\varOmega )\) admits a quasi-continuous representative.

Therefore, the stationarity conditions of Mignot and Puel, see (29) below, should be considered the true strong/S-stationarity conditions for our canonical MPEC as opposed to (9)–(10). For more complex constraints in which the image space of the constraint mapping is an Lp-space, as in, e.g.,[34, 35, 45], the problems with defining active sets up to sets of capacity zero seem to be absent.

Nevertheless, (9)–(10) arose naturally through the limiting process of an adaptive penalty scheme. As such, they are highly relevant for the study of numerical methods for elliptic MPECs, especially for methods utilizing smoothing plus continuation. In contrast, it is unclear how to properly include notions of capacity into efficient numerical methods (without simplifying assumptions as in [46]).

Finally, in many complex real-world applications, it might be impossible to obtain even C-stationarity conditions of this type, due to a lack of compactness and regularity properties. For example, in the optimal control of electrowetting on dielectrics [4], Allen-Cahn [27, 28], Cahn-Hilliard [20], or Cahn-Hilliard-Navier-Stokes system with obstacle potentials [47], one can usually only derive an approximation of (9g). Here, this would be equivalent to replacing (9g) by

$$\displaystyle \begin{aligned} \limsup_{k\to \infty} \langle \lambda_k, p_k\rangle \le 0 \end{aligned}$$

for sequences \(\left \{\lambda _k\right \}\) and \(\left \{p_k\right \}\) with \(\lambda _k \rightharpoonup \lambda \) and \(p_k \rightharpoonup p\).

In conclusion, the various notions of dual stationarity and the theory needed to derive them are still active areas of research. The main point here is that there is a stratification of concepts that one should also be aware of when considering the design and convergence of numerical methods. That is, it is not enough to prove that a scheme converges but also to what kind of stationary point. Of course, many of the issues involving capacity, weak topologies, products of weakly convergent sequences, and weak lim-inf or lim-sup-type sign conditions will not necessarily appear in numerical experiments.

4 Regularization-Based Methods

In this and the coming sections, we present a number of numerical optimization methods for elliptic MPECs. These are split into two classes: Methods that employ adaptive smoothing, relaxation, or penalization of the forward problem or complementarity constraints (regularization-based methods) and those that do not (non-smooth methods).

We also note that the proper discretization of elliptic MPECs using (adaptive) finite elements schemes that take into account the additional difficulties due to the inherent degeneracy/non-smoothness has only been considered in a handful of papers. For example, we mention the most recent works [16, 17, 24, 37, 62]. For our numerical study, we make use of a simple finite difference scheme to discretize the operators along with a nested grid approach to simulate mesh refinements. This allows us to easily compare the methods.

The regularization-based methods all follow a similar scheme. First, the variational inequality is approximated by a parameter-dependent semilinear elliptic PDE or the complementarity constraint is relaxed or penalized. This yields in both cases a more tractable family of approximating optimization problems. Next, the smooth PDE-constrained problems are solved, yielding a parameter-dependent KKT point. Finally, continuation is performed on the regularization parameter (passing to 0 or + ).

The non-smooth methods are rather different and there is still plenty of room for new ideas. We only mention here that with these methods the emphasis is placed on directly solving the original, non-smooth problem without changing the forward problem. We postpone further details until later.

4.1 An Adaptive Penalty Method

We begin in the abstract framework of Section 2.5 and present a general approximation result as found in [7, Thm. 2.2]. For a comprehensive study on the numerical analysis and approximation of variational inequalities, see, e.g., [31]. In the following, ϕ𝜖 refers to some penalty functional for the constraint K.

Definition 4

For any constant 𝜖 > 0, let \(\phi ^{\epsilon }: V \to \mathbb {R}\) such that ϕε is convex and Fréchet differentiable on V  and satisfies

  1. 1.

    There exists a C, independent of 𝜖, u with ϕ𝜖(u) ≥−C(∥uV + 1) for all ε > 0, u ∈ V .

  2. 2.

    ϕ𝜖(u) → iK(u) as 𝜖 0 for all u ∈ V .

  3. 3.

    For all u ∈ V  and for all {u𝜖} such that \(u_{\epsilon } \rightharpoonup u\) as 𝜖 0, liminf𝜖↓0ϕ𝜖(u𝜖) ≥ iK(u).

Remark 3

See also [7, Thm 2.4] for functionals on H, which is more relevant for (13).

In the abstract setting, we approximate (1) by

$$\displaystyle \begin{aligned} Au + \nabla \phi^{\epsilon}(u) = Bz + f \end{aligned} $$
(11)

We denote the approximate solution mapping by S𝜖. For the next result, we restrict ourselves to the case when A is symmetric (as in Example 2), see [7, Thm. 2.2].

Theorem 4

Let w 𝜖 → w in Vas 𝜖 ↓ 0, then the sequence {u𝜖}⊂ V  with u𝜖 := S𝜖(w𝜖) converges weakly to u = S(w) as 𝜖 ↓ 0. If in addition,

$$\displaystyle \begin{aligned} \langle \nabla \phi^{\epsilon}(y) - \nabla \phi^{\delta}(v),y-v\rangle \ge -C(1 + (\| \nabla \phi^{\epsilon}(y) \|{}^2_{V^*} + \| \nabla \phi^{\delta}(v) \|{}^2_{V^*}))(\epsilon + \delta), \\ \forall \epsilon, \delta > 0, \forall y,v \in V, \end{aligned} $$
(12)

then u 𝜖 → u (strongly in V ) as 𝜖 ↓ 0.

In the setting of Example 2, one possibility for ϕ𝜖 is:

$$\displaystyle \begin{aligned} \phi^{\epsilon}(u) := (2\epsilon)^{-1}\|(-u)_+\|{}^2_{L^2}. \end{aligned} $$
(13)

Redefining K0 for L2-functions, ϕ𝜖 in (13) is the Moreau-Yosida regularization of the indicator functional \(i_{K_0}\) with respect to the L2-topology, cf. [5, Sec. 2.7]. However, in order to solve the smoothed MPECs numerically, we will require more smoothness of ϕ𝜖 below. We now approximate (5) by:

$$\displaystyle \begin{aligned} \min\ &\mathcal{J}_{\epsilon}(z) := F(S_{\epsilon}(Bz)) + G(z) \text{ over } z\in Z_{\mathrm{ad}},{} \end{aligned} $$
(14)

Thus, the existence theory, optimality conditions, and numerical methods for (14) reduces to results from PDE-constrained optimization.

Turning to Example 2, we employ a smoothed plus function in (11) characterized by:

$$\displaystyle \begin{aligned} \nabla \phi^{\epsilon}(u) = \left\{\begin{array}{ll} \epsilon^{-1}u-0.5,& \text{ if } u \ge \epsilon,\\ \frac{u^3}{\epsilon^3} - \frac{u^4}{2\epsilon^4},& \text{ if } u \in (0,\epsilon),\\ 0,& \text{ else,} \end{array}\right. \quad \nabla^2 \phi^{\epsilon}(u) = \left\{\begin{array}{ll} \epsilon^{-1},& \text{ if } u \ge \epsilon,\\ \frac{3u^2}{\epsilon^3} - \frac{2u^3}{\epsilon^4},& \text{ if } u \in (0,\epsilon),\\ 0.& \text{ else.} \end{array}\right. \end{aligned} $$
(15)

We can then prove that the approximate solution mapping S𝜖 is Fréchet differentiable. Then, for a solution (z𝜖, u𝜖), there exists an adjoint state p𝜖 ∈ V  such that

$$\displaystyle \begin{aligned} \alpha (z_{\epsilon}, w - z_{\epsilon}) + (B^*p_{\epsilon}, w - z_{\epsilon}) &\ge 0, \quad \forall w \in Z_{\mathrm{ad}}{} \end{aligned} $$
(16a)
$$\displaystyle \begin{aligned} A^*p_{\epsilon} - \lambda_{\epsilon} &= u_d - u_{\epsilon},{} \end{aligned} $$
(16b)
$$\displaystyle \begin{aligned} Au_{\epsilon} - \xi_{\epsilon}, &= Bz_{\epsilon} + f,{} \end{aligned} $$
(16c)
$$\displaystyle \begin{aligned} \xi_{\epsilon} &= \nabla \phi^{\epsilon}(-u_{\epsilon}),\text{ a.e. } x \in \varOmega {} \end{aligned} $$
(16d)
$$\displaystyle \begin{aligned} \lambda_{\epsilon} &= -\nabla^2 \phi^{\epsilon}(-u_{\epsilon})p_{\epsilon},\text{ a.e. } x \in \varOmega, {} \end{aligned} $$
(16e)

holds; cf. the techniques in [57, 77].

We note here that the case of gradient constraints mentioned in (3) can also be treated using such a penalty method. However, using a standard quadratic penalty/Moreau-Yosida-type approach yields a penalized state equation of the form:

$$\displaystyle \begin{aligned} -\mathrm{div}((1 + \xi_{\epsilon}) \nabla u_{\epsilon}) &= Bz_{\epsilon} + f, \end{aligned} $$
(17a)
$$\displaystyle \begin{aligned} \xi_{\epsilon} &= \frac{1}{\epsilon}\left(1 - \frac{\varPsi}{|\nabla u_{\epsilon}|} \right)_+, \end{aligned} $$
(17b)

which is considerably more challenging due to the bilinear dependence on u𝜖 in the PDE and accompanying non-smooth first-order PDE.

From (16), it is possible to derive C-stationarity conditions and, under further assumptions, even S-stationarity conditions, cf. [39, 40], by passing to the limit in 𝜖. More specifically, we show that a subsequence of stationary points for (16) converges to a point that satisfies C-stationarity for (7). Therefore, if we have a numerical method that solves (16), then by performing continuation on 𝜖 0 we have a means of numerically approximating C-stationary points (or better) for (7). This furnishes a convergence proof in function space for the “outer loop” of the method, depicted in Algorithm 4.1. For the interested reader, we provide a short discussion of the limiting arguments in Detail 5.

Algorithm 4.1 Adaptive penalty method: outer loop

Detail 5 (Sketch of the Limiting Technique)

We first note that (16a) is equivalent to

$$\displaystyle \begin{aligned} z_{\epsilon} = \mathrm{Proj}_{Z_{\mathrm{ad}}}(-\frac{1}{\alpha} B^*p_{\epsilon}).\end{aligned} $$
(18)

Suppose Zad is bounded, then {z𝜖}𝜖>0 is bounded in Z. Let 𝜖k 0. Then, there exists a subsequence {zl} with \(z_{l} = z_{\epsilon _{k_l}} \stackrel {Z}{\rightharpoonup }\)z ∈ Zad. Hence, \(S_{\epsilon _{k_l}}(Bz_{l}) =: u_{l} \stackrel {V}{\to } u^{\star } = S(Bz^{\star })\) by Theorem 4 due to the complete continuity of B. Therefore, \(\xi _{l} \to \xi ^{\star } := Bz^{\star } + f -Au^{\star } \in -\mathcal {N}_{K_0}(u^{\star })\), i.e., ξ, u satisfy. For the adjoint state p𝜖 and multiplier λ𝜖, we test the adjoint equation (16b) with p𝜖:

$$\displaystyle \begin{aligned}\displaystyle c_1\|p_{\epsilon}\|{}^2_{V} \le \langle A^*p_{\epsilon} - \lambda_{\epsilon},p_{\epsilon}\rangle = (u_d - u_{\epsilon},p_{\epsilon}) \le\\\displaystyle \|u_d - u_{\epsilon}\|{}_{H}\| p_{\epsilon}\|{}_{H} \le c_2 \|u_d - u_{\epsilon}\|{}_{H}\| p_{\epsilon}\|{}_{V} \end{aligned} $$

for scalars c1, c2 > 0, independent of 𝜖. Since \(u_{l} \stackrel {V}{\to } u^{\star }\), {pl} is bounded in V . Hence, there exists \(\{p_{l_m}\}\) with \(p_{l_{m}}\stackrel {V}{\rightharpoonup }p^{\star }\) and, by substitution, \(\lambda _{l_m} \stackrel {V^*}{\rightharpoonup } \lambda ^{\star } = u^{\star } -u_d +A^*p^{\star }\).

Without knowledge of the best possible system, one might stop at this point; however, the theory indicates that a much more refined system is possible. We quickly demonstrate (9g). The remaining conditions (9h), (9i) require lengthy arguments that go beyond the scope of this work. Returning to (16b), it is clear that

$$\displaystyle \begin{aligned} \langle A^*p_{\epsilon},p_{\epsilon}\rangle - (u_d - u_{\epsilon},p_{\epsilon}) = \langle\lambda_{\epsilon},p_{\epsilon}\rangle = -\int_{\varOmega} \nabla^2 \phi^{\epsilon}(-u_{\epsilon})| p_{\epsilon} |{}^2 \mathrm{d}x \le 0. \end{aligned}$$

Since A is symmetric, the bilinear form a(⋅, ⋅) is weakly lower-semicontinuous. Moreover, due to the Rellich-Kondrachov theorem, both \(u_{\epsilon } \stackrel {H}{\to } u^{\star }\), \(p_{\epsilon } \stackrel {H}{\to }p^{\star }\). Thus,

$$\displaystyle \begin{aligned} \langle \lambda^{\star},p^{\star}\rangle = \langle A^*p^{\star},p^{\star}\rangle - (u_d - u^{\star},p^{\star}) \le \liminf_{m\to +\infty} \langle A^*p_{l_m},p_{l_m}\rangle - (u_d - u_{l_m},p_{l_m}) \le 0.\end{aligned} $$
(19)

In [7], a more general setting allows a similar argument. However in general, e.g., if A is not symmetric, we only have:

$$\displaystyle \begin{aligned} \limsup_{\epsilon \downarrow 0 } \langle\lambda_{\epsilon},p_{\epsilon}\rangle \le 0 \end{aligned}$$

as mentioned at the end of Section 3.3. □

A possible stopping rule in Algorithm 4.1 could be the residual of the C-stationarity system (9) or when a minimum value of 𝜖k+1 is reached, e.g., machine precision. Obviously, the most computationally demanding part here is step 3. Assuming that \(\mathrm {Proj}_{Z_{\mathrm {ad}}}\) is relatively simple to calculate, e.g., for local bilateral constraints, then (16) reduces to

$$\displaystyle \begin{aligned} A^*p_{\epsilon} +\nabla^2 \phi^{\epsilon}(-u_{\epsilon})p_{\epsilon} &= u_d - u_{\epsilon}, \end{aligned} $$
(20a)
$$\displaystyle \begin{aligned} Au_{\epsilon} - \nabla \phi^{\epsilon}(-u_{\epsilon}) &= B\mathrm{Proj}_{Z_{\mathrm{ad}}}(-\frac{1}{\alpha} B^*p_{\epsilon}) + f. \end{aligned} $$
(20b)

One could then solve (20) using, e.g., a semismooth Newton method. Alternatively, an interior point approach for the projection might be employed. We briefly recall the semismooth Newton method in infinite-dimensional spaces as discussed in [19, 38, 79]. Let X, Y  be Banach spaces, D ⊂ X an open subset of X, and \(\mathcal {F}:D \to Y\).

Definition 5

The mapping \(\mathcal {F}:D \subset X \to Y\) is said to be Newton-differentiable on the open subset U ⊂ D, if there exists a family of mappings \(\mathcal {G}:U \to \mathcal {L}(X,Y)\) such that

$$\displaystyle \begin{aligned} \|\mathcal{F}(x+h) - \mathcal{F}(x) - \mathcal{G}(x+h)h\|{}_{Y} = o(\|h\|{}_{X}), \forall x \in U. \end{aligned}$$

\(\mathcal {G}\) is called the Newton derivative for \(\mathcal {F}\) on U. In [38], it is shown that

$$\displaystyle \begin{aligned} \mathcal{G}_{\delta}(y)(x) = \left\{\begin{array}{ccl} 1 & \text{if} & y(x) > 0\\ 0 & \text{if} & y(x) < 0\\ \delta & \text{if} & y(x) = 0 \end{array}\right. \end{aligned} $$
(21)

for every y ∈ X and \(\delta \in \mathbb {R}\) is a Newton derivative of the \(\max (0,\cdot )\), under the condition that \(\max (0,\cdot ):L^{p}(\varOmega ) \to L^{q}(\varOmega )\) with 1 ≤ q < p ≤.

Therefore, if a mapping \(\mathcal {F}\) is Newton-differentiable, then using the concepts above leads to a generalized Newton step for the equation \(\mathcal {F}(x) = 0\), see, e.g., [19, 38].

Theorem 6

Suppose that \(\mathcal {F}(x^*) = 0\) and that \(\mathcal {F}\) is Newton-differentiable on an open neighborhood U of x with Newton derivative \(\mathcal {G}\) . If \(\mathcal {G}(x)\) is nonsingular for all x  U and the set\(\left \{||\mathcal {G}(x)^{-1}||{ }_{\mathcal {L}(Y,X)} : x \in U\right \}\)is bounded, then the semismooth Newton iteration

$$\displaystyle \begin{aligned} x_{l+1} = x_l - \mathcal{G}(x_l)^{-1}\mathcal{F}(x_l),\quad l=0,1,2,\ldots \end{aligned} $$
(22)

converges superlinearly to x , provided that ||x0 − x||Xis sufficiently small.

Under the assumption that \(Z_{\mathrm {ad}} := \left \{v \in L^2(\varOmega ) \left |\; a \le v \le b, \text{ a.e. } x \in \varOmega \right .\right \}\) with a, b ∈ L2(Ω) and a < b, we can derive a semismooth Newton step for the solution of (20) in function space: Fix some \((u,p) \in H^1_0(\varOmega ) \times H^1_0(\varOmega )\) and define the following subsets of Ω:

$$\displaystyle \begin{aligned} \varOmega^a &:= \left\{ x \in \varOmega \left| a(x) + \alpha^{-1}(B^*p)(x) > 0\right.\right\},\\ \varOmega^b &:= \left\{ x \in \varOmega \left| -\alpha^{-1}(B^*p)(x)- b(x) > 0\right.\right\}. \end{aligned} $$

Moreover, let Ωina := Ω ∖ (Ωa ∪ Ωb) (up to a set of Lebesgue measure zero) and define the residual \(\mathcal {F}_{\epsilon }(u,p)\) of (20) by:

$$\displaystyle \begin{aligned} \mathcal{F}^1_{\epsilon}(u,p) &:= Au - \nabla \phi^{\epsilon}(-u) - B\mathrm{Proj}_{Z_{\mathrm{ad}}}(-\frac{1}{\alpha} B^*p) - f,\\ \mathcal{F}^2_{\epsilon}(u,p) &:= A^*p +\nabla^2 \phi^{\epsilon}(-u)p - u_d + u.\end{aligned} $$

Since

$$\displaystyle \begin{aligned} z = \mathrm{Proj}_{Z_{\mathrm{ad}}}(-\alpha^{-1}B^*p) = -\alpha^{-1} B^*p - (-\alpha^{-1}B^*p - b)_+ + (a+\alpha^{-1}B^*p)_+,\end{aligned} $$
(23)

we can use (21) to obtain a Newton derivative \(\mathcal {G}\) for \(\mathcal {F}\):

$$\displaystyle \begin{aligned} \mathcal{G}_{\epsilon}(u,p) = \left[ \begin{array}{cc} A + \nabla^2 \phi^{\epsilon}(-u) & \alpha^{-1}\chi_{\varOmega^{ina}}BB^*\\ I- \nabla^3 \phi^{\epsilon}(-u)p & A^* + \nabla^2 \phi^{\epsilon}(-u)\\ \end{array} \right], \end{aligned}$$

where \(\chi _{\varOmega ^{ina}}\) is the characteristic function for the set Ωina. If (δu, δp) denotes the difference between the new iterate and the current iterate in the semismooth Newton step, then at each iteration, we solve

$$\displaystyle \begin{aligned} \left[ \begin{array}{cc} A + \nabla^2 \phi^{\epsilon}(-u) & \alpha^{-1}\chi_{\varOmega^{ina}}BB^*\\ I- \nabla^3 \phi^{\epsilon}(-u)p & A^* + \nabla^2 \phi^{\epsilon}(-u)\\ \end{array} \right]\left[\begin{array}{c}\delta u\\ \delta p\end{array}\right] = -\mathcal{F}_{\epsilon}(u,p). \end{aligned} $$
(24)

If we can show that \(\mathcal {G}_{\epsilon }(u,p)\) is invertible independently of (u, p) (for fixed 𝜖 > 0) so that the set \(\left \{||\mathcal {G}_{\epsilon }(u,p)^{-1}||: (u,p) \in H^1_0(\varOmega )\times H^1_0(\varOmega ) \right \}\) is bounded, then we are guaranteed to have local superlinear convergence. This leads to Algorithm 4.2.

Algorithm 4.2 Adaptive penalty method: inner loop

We conclude this subsection with a numerical experiment. This is used in part to compare to the non-smooth numerical methods in later sections.

Example 3

Let Ω = (0, 1)2, α = 1, b ≡ 0.035, a ≡ 0, and A = −Δ (associated with \(H^1_0(\varOmega )\)). Defining

$$\displaystyle \begin{aligned} y^{\dagger}({\mathbf{x}}_1,{\mathbf{x}}_2) = \left\{\begin{array}{ll} 160({\mathbf{x}}^3_1 - {\mathbf{x}}^2_1 + 0.25{\mathbf{x}}_1)({\mathbf{x}}_2^3 - {\mathbf{x}}^2_2 + 0.25{\mathbf{x}}_2) & \text{in } (0,0.5)\times (0,0.5),\\ 0 & \text{else,} \end{array}\right. \end{aligned}$$
$$\displaystyle \begin{aligned} \xi^{\dagger}({\mathbf{x}}_1,{\mathbf{x}}_2) = \max(0,-2|{\mathbf{x}}_1 - 0.8| - 2|{\mathbf{x}}_1{\mathbf{x}}_2 - 0.3| + 0.5), \end{aligned}$$

we set

$$\displaystyle \begin{aligned} f = -\varDelta y^{\dagger} - y^{\dagger} - \xi^{\dagger},\text{ and } u_d = y^{\dagger} + \xi^{\dagger} - \alpha\varDelta y^{\dagger}. \end{aligned}$$

The example is chosen due to the nontrivial biactive set and overlap of active sets for the control and state constraints. The only change to [40, Exp. 5.1] is the addition of control constraints, which were set to ± there. Concerning the discretization and solution, we use a standard five-point stencil to discretize the negative Laplace operator with finite differences. The problem is solved on a uniform mesh with 5122 grid points.

We start the algorithm at (z0, u0, ξ0) = (0, 0, 0). The stopping criterion is based on the L2-norm of the residual with stopping tolerance of 10−9. For this example, the 𝜖-update in 4.1 proved to be extremely sensitive, meaning that a reasonably aggressive update strategy, e.g., 𝜖k+1 = 𝜖k∕2 failed once \(\epsilon _{k} = \mathcal {O}(10^{-4})\). To be fair, one would normally not cold-start this algorithm on such a fine grid. Opting instead for either an adaptive FEM strategy, multigrid scheme (as in [40]), or a nested grid strategy (as in [39, 46]) would certainly improve the performance and allow for a more aggressive update strategy for the smoothing parameter. Nevertheless, we see that as the penalized problem approaches the original non-smooth non-convex problem, the nonlinear system becomes increasingly more difficult to solve (eventually even failing). See Table 1 for the convergence history and Figure 1 for plots of the solution.

Fig. 1
figure 1figure 1

Solution plots for adaptive penalty method Algorithms 4.1 and 4.2, clockwise from upper left: characteristic function \( \chi _{\varOmega ^{0}_{\star }}\) optimal control z, state u, multiplier λ, adjoint p, and multiplier ξ. Notice the nontrivial biactive set for the variational inequality and the upper and lower active sets for the control constraints

Table 1 Convergence history of the adaptive penalty method Algorithm 4.1 with stopping tolerance tol = 10−9, starting 𝜖1 = 1 using 𝜖k = 2−(k−1) with k = 1, …, 12, l represents the iterations per inner loop of Algorithm 4.2

4.2 An 1 Penalty Method

In this subsection, we present a technique originating in the finite-dimensional MPEC literature [3]. The extension to infinite dimensions can be found here [43]. The idea is elegant in its simplicity and allows us to approximate the elliptic MPEC by a sequence of PDE-constrained optimization problems with control and state constraints. Moreover, instead of a semilinear elliptic PDE, as in the previous method, we have a linear elliptic PDE. We begin in the abstract framework of Section 2.5 under the assumption that K is a cone and then pass to the problem in Example 2.

Using an 1-penalty for the condition 〈u, ξ〉 = 0, we approximate (6) by

$$\displaystyle \begin{aligned} \min\ &J(z,u) + \frac{1}{\epsilon}|\langle u,\xi \rangle | \text{ over } (z,u,\xi) \in Z \times V \times V^*, \end{aligned} $$
(25a)
$$\displaystyle \begin{aligned} \text{s.t. } &z \in Z_{\mathrm{ad}},{} \end{aligned} $$
(25b)
(25c)
(25d)

By definition of K+, 〈u, ξ〉≥ 0. Hence, for (u, ξ) ∈ K × K+, \(\frac {1}{\epsilon }|\langle u,\xi \rangle | =\frac {1}{\epsilon }\langle u,\xi \rangle \), which yields the smooth objective \(J_{\epsilon }(z,u,\xi ):=J(z,u) + \frac {1}{\epsilon }\langle u,\xi \rangle \).

The analysis for this problem requires several technical results. Nevertheless, under appropriate regularity and boundedness assumptions, one can still show existence of a solution, consistency of the approximation, derive first-order conditions and (after passing to the limit in 𝜖) obtain a (weak) form of C-stationarity. We briefly sketch the ideas here and refer the reader to [43, Section 2] for the detailed technical analysis in the context of Example 2.

Detail 7 (Sketch of Existence and Consistency Arguments)

In addition to the assumptions in Theorem 2, let F be weakly lower-semicontinuous, Zad bounded, and A symmetric.

To show that (25) has a solution, we prove the boundedness of infimizing sequence {(zk, uk, ξk)} (despite the unboundedness of K+ and lack of coercivity of J𝜖(z, u, ξ)). Let (z0, u0, ξ0) be feasible for (6). Then, for all sufficiently large k ∈ N,

$$\displaystyle \begin{aligned} J(z_0,u_0) = J_{\epsilon}(z_0,u_0,\xi_0) \ge\\ J_{\epsilon}(z_k,u_k,\xi_k) = F(u_k) + G(z_k) + \frac{1}{\epsilon}\langle u_k, \xi_k\rangle \ge G(z_k) + \frac{1}{\epsilon}\langle u_k, \xi_k\rangle. \end{aligned} $$
(26)

Hence, \(J(z_0,u_0) - G(z_k) \ge \frac {1}{\epsilon }\langle u_k, \xi _k\rangle \ge 0\). Since Zad is bounded, there exists a weakly convergent subsequence \(\{z_{k_l}\}\) with \(z_{k_l} \rightharpoonup z^{\star } \in Z_{\mathrm {ad}}\). Then, by the weak lower-semicontinuity of G, we have

$$\displaystyle \begin{aligned} 0 \le \liminf_{l} \langle u_{k_l}, \xi_{k_l}\rangle \le \limsup_{l} \langle u_{k_l}, \xi_{k_l}\rangle \le J(z_0,u_0) - G(z^{\star}). \end{aligned}$$

Thus, testing (25c) with \(u_{k_l}\), we first obtain the boundedness of \(\{u_{k_l}\}\) in V  and then \(\{\xi _{k_l}\}\) in V. Given A is symmetric, we use an argument as in (19) to prove that (along a further subsequence \(\{k_{l_m}\}\)) \(\liminf _{m} \langle u_{k_{l_m}}, \xi _{k_{l_m}}\rangle \ge \langle u^{\star },\xi ^{\star }\rangle \), where u and ξ are the weak limit points. Since K and K+ are weakly closed, u ∈ K and ξ ∈ K+. This suffices to prove that (25) has a solution. A proof for consistency of the approximation, i.e., that global optimizers {(z𝜖, u𝜖, ξ𝜖)} converge as 𝜖 0 (at least along a subsequence) to a global optimizer {(z, u, ξ)} of (6) can be derived analogously using the boundedness of {z𝜖}⊂ Zad and subsequently, the inequality 𝜖(J(z0, u0) − G(z𝜖)) ≥〈u𝜖, ξ𝜖〉≥ 0. □

As in Section 4.1, we restrict ourselves to the setting of Example 2 for the derivation of stationarity conditions. Here, the derivation of first-order optimality conditions requires the verification of a constraint qualification, in this case that of Robinson-Zowe-Kurcyusz, see [88]. According to Propositions 2.5 and 2.6 in [43], we have the following 𝜖-dependent first-order optimality conditions: For any solution (z𝜖, u𝜖, ξ𝜖) to (25) (under the data assumptions of Example 2), there exists a multiplier-tuple (p𝜖, 𝜗𝜖, τ𝜖) such that

$$\displaystyle \begin{aligned} \alpha (z_{\epsilon}, w - z_{\epsilon}) + (B^*p_{\epsilon}, w - z_{\epsilon}) &\ge 0, \quad \forall w \in Z_{\mathrm{ad}}{} \end{aligned} $$
(27a)
$$\displaystyle \begin{aligned} A^*p_{\epsilon} + \frac{1}{\epsilon}\xi_{\epsilon} - \vartheta_{\epsilon} &= u_d - u_{\epsilon},{} \end{aligned} $$
(27b)
$$\displaystyle \begin{aligned} Au_{\epsilon} - \xi_{\epsilon} &= Bz_{\epsilon} + f,{} \end{aligned} $$
(27c)
$$\displaystyle \begin{aligned} u_{\epsilon} \in K_{0}, \; \xi_{\epsilon} &\in K^+_{0}, {} \end{aligned} $$
(27d)
$$\displaystyle \begin{aligned} \frac{1}{\epsilon}u_{\epsilon} - p_{\epsilon} - \tau_{\epsilon} &= 0, {} \end{aligned} $$
(27e)
$$\displaystyle \begin{aligned} \vartheta_{\epsilon} \in K^+_{0}, \; \langle \vartheta_{\epsilon}, u_{\epsilon} \rangle = 0, \; \tau_{\epsilon} \in K_{0}, \; \langle \xi_{\epsilon}, \tau_{\epsilon} \rangle &= 0. {} \end{aligned} $$
(27f)

In comparison, this system is much larger than (16) and contains additional information. Nevertheless, under certain boundedness assumptions, passing to the limit in 𝜖 yields, in fact, a weaker C-stationarity system, see [43, Thm. 2.9]:

$$\displaystyle \begin{aligned} \alpha (z^{\star}, w - z^{\star}) + (B^*p^{\star}, w - z^{\star}) &\ge 0, \quad \forall w \in Z_{\mathrm{ad}}{} \end{aligned} $$
(28a)
$$\displaystyle \begin{aligned} A^*p^{\star} - \lambda^{\star} &= u_d - u^{\star}, {} \end{aligned} $$
(28b)
$$\displaystyle \begin{aligned} Au^{\star} - \xi^{\star} &= Bz^{\star} + f,{} \end{aligned} $$
(28c)
$$\displaystyle \begin{aligned} u^{\star} \in K_{0}, \; \xi^{\star} \in K^+_{0}, \; \langle u^{\star},\xi^{\star}\rangle &=0, {} \end{aligned} $$
(28d)
$$\displaystyle \begin{aligned} \langle \lambda^{\star}, u^{\star}\rangle = 0, \; \langle p^{\star} , \xi^{\star}\rangle = 0, \; \langle \lambda^{\star},p^{\star} \rangle &\le 0. {} \end{aligned} $$
(28e)

To see that (28e) relates to (9h), (9i) suppose for the sake of argument that ξ, p, λ are merely vectors of length n and the conditions in (28e) are understood as the componentwise (Hadamard) products. Then, by complementarity, ξ ≥ 0 and \(\xi ^{\star }_i = 0\) if i is an inactive or biactive index and 〈p, ξ〉 = 0 in turn implies that the strongly active components of p are zero (as in (9h)). The same applies to the inactive components of λ due to 〈λ, u〉 = 0 (as in (9i)).

Though there is a significant gap, the derivation of (28) is related to the convergence of a function-space-based numerical method. Indeed, using known results for linear elliptic PDE-constrained optimization problems with control and state constraints, we have viable efficient algorithms that can guarantee convergence to a KKT point, which satisfies (27). By performing continuation on 𝜖 we can be assured to converge (along a subsequence) to a weak C-stationary point. Furthermore, Theorem 2.12 in [43] provides a more compelling argument.

Theorem 8 (Thm 2.12 [43])

Suppose (z𝜖, u𝜖, ξ𝜖, p𝜖, 𝜗𝜖, τ𝜖) satisfies (27) and that (z𝜖, u𝜖, ξ𝜖) is feasible for (7). Then, (z𝜖, u𝜖, ξ𝜖) is strongly stationary in the sense of Mignot and Puel, i.e., conditions (9h)(10b) are replaced by

$$\displaystyle \begin{aligned} p &= 0,\mathit{\text{ q.e. }} x \in \varOmega^{0+} {} \end{aligned} $$
(29a)
$$\displaystyle \begin{aligned} \langle \lambda, \phi \rangle &= 0, \forall \phi \in H^1_0(\varOmega): \phi = 0,\mathit{\text{ q.e. }} \varOmega^{0} {} \end{aligned} $$
(29b)
$$\displaystyle \begin{aligned} p &\le 0,\mathit{\text{ q.e. }} x \in \varOmega^{00} {} \end{aligned} $$
(29c)
$$\displaystyle \begin{aligned} \langle \lambda, \phi \rangle &\ge 0, \forall \phi \in H^1_0(\varOmega): \phi \ge 0,\mathit{\text{ q.e. }} \varOmega^{00}, \phi = 0,\mathit{\text{ q.e. }} \varOmega \setminus (I \cup \varOmega^{00}) {} \end{aligned} $$
(29d)

Since we are using an 1 penalty, which often amounts to an exact penalty function, there is a good chance that a stationary point (z𝜖, u𝜖, ξ𝜖) is feasible for (7) for sufficiently small 𝜖.

Algorithm 4.3 1 penalty method: outer loop

Again we might choose as a stopping criterion the residual of C-stationarity, taking λ := 𝜖−1ξ − 𝜗 and substituting τ = 𝜖−1u − p. Since the theory only guarantees 𝜗 ∈ H−1(Ω), one will need to treat the discrete quantities carefully, cf. [43] for more details. In addition, the solution of the subproblems (25) does not reduce to the solution of a system similar to (20). Therefore, the solution of the subproblems can be more difficult than in the adaptive penalty framework. Nevertheless, the theoretical result in Theorem 8 indicates the potential of this algorithm to generate a better stationary point.

5 Non-Smooth Numerical Methods

In this section, we present several methods that do not require a smoothing or penalization of the original MPEC. We present a new approximate projected subgradient method alongside a direct solver for the C-stationarity system presented in [36] and a recent method from [46] that may serve as a globalization of the direct solver. In all of these methods, we need to solve the variational inequality (1). This can be done using the semismooth Newton methods as in [38, 41] or special monotone multigrid methods as in [55].

5.1 An Approximate Projected Subgradient Method

The subgradient method is perhaps the simplest non-smooth optimization algorithm. Suppose X is a real separable Hilbert space. Given a proper, convex, lower-semicontinuous, and subdifferentiable functional \(f: X \to \mathbb {R}\), x0 ∈ X, g0 ∈ ∂f(x0), and a sequence {νk} with νk > 0, compute a sequence of iterates {xk} according to the rule

$$\displaystyle \begin{aligned} x^{k+1} := x^k - \nu_k g^k, \quad g^{k+1} \in \partial f(x^k).\end{aligned} $$

Here, and unless otherwise noted below, we would need to apply the Riesz map to gk before using it in this iteration.

Similarly, given a non-empty, closed, and convex subset C ⊂ X the projected-gradient method replaces the previous rule by

$$\displaystyle \begin{aligned} x^{k+1} := \mathrm{Proj}_{C}\, (x^k - \nu_k g^k), \quad g^{k+1} \in \partial f(x^{k+1}).\end{aligned} $$
(30)

The subgradient method was invented by N.Z. Shor in the 1970s, see [76], and although it does not guarantee descent of the objective functional and can be quite slow to converge, it still finds a wide array of applications due to its simplicity and ability to be combined with distributed algorithm techniques, as in, e.g., [78].

Consider the reduced elliptic MPEC in (5):

$$\displaystyle \begin{aligned} \min\ &\mathcal{J}(z) := F(S(Bz)) + G(z) \text{ over } z\in Z_{\mathrm{ad}}.\end{aligned} $$

Since S is nonlinear, the reduced objective functional is typically non-convex. Therefore, we cannot directly apply the projected subgradient method. However, there exist a number of generalized subdifferentials for non-convex functions. In our case, we will initially make use of the limiting subdifferential (also known as the Mordukhovich subdifferential) for the reduced objective. We will restrict ourselves to the framework of Example 2. First, we recall several definitions from variational analysis, see [66].

Definition 6 (Normal Cones to Arbitrary Sets)

Let X be a Hilbert space and C ⊂ X. Then, the multifunction \(\widehat {\mathcal {N}}_{C}:X \rightrightarrows X^*\) defined by

$$\displaystyle \begin{aligned} \widehat{\mathcal{N}}_{C}(x):= \left\{ x^* \in X^*\left|\; \langle x^*,x' {-} x\rangle_{X} \le o(||x'{-}x||{}_{X}),\; \forall x'\stackrel{X}{\to} x, x' \in C \right.\right\},\quad x\in C,\end{aligned} $$
(31)

and \(\widehat {\mathcal {N}}_C(x):=\emptyset \) for xC is called the regular (Fréchet) normal cone to C. The multifunction \(\mathcal {N}_{C}:X\rightrightarrows X^*\) defined by

$$\displaystyle \begin{aligned} \mathcal{N}_{C}(x):= \left\{x^*\in X^*\left|\;\exists x_k \stackrel{X}{\to}x,\,\exists x^*_k\stackrel{X^*}{\rightharpoonup} x^*:\,x^*_k\in\widehat{\mathcal{N}}_{C}(x_k),\,\,\forall k\in\mathbb{N}\right.\right\} \end{aligned} $$
(32)

is called the limiting (Mordukhovich) normal cone to C.

Although \(\widehat {\mathcal {N}}_{C}\) is convex, it fails to admit a satisfactory calculus needed for most non-smooth, non-convex problems. In contrast, the limiting normal cone enjoys a robust calculus. Note that for closed convex sets C, both cones agree, and in general \(\widehat {\mathcal {N}}_{C}(x) \subsetneqq \mathcal {N}_{C}(x)\). We will use the limiting normal cone to define a generalized subgradient, needed in part for our proposed numerical method.

Definition 7 (Limiting Subdifferential)

Let X be a Hilbert space, \(\phi :X \to \overline {\mathbb {R}}\), and x ∈ X such that |ϕ(x)| < +. The set

$$\displaystyle \begin{aligned} \partial \phi(x) := \left\{ x^* \in X^* \left|\; (x^*,-1)\in \mathcal{N}_{\mathrm{epi}\ \phi}(x,\phi(x)) \right.\right\} \end{aligned} $$
(33)

is called the limiting (Mordukhovich) subdifferential. If |ϕ(x)| = , we set ∂ϕ(x) = ∅.

Therefore, if we know the limiting subdifferential \(\partial \mathcal {J}(z)\) in (5), then we could design a projected subgradient iteration along the lines of (30):

$$\displaystyle \begin{aligned} z^{k+1} := \mathrm{Proj}_{Z_{\mathrm{ad}}}\, (z^k - \nu_k g^k), \quad g^{k+1} \in \partial \mathcal{J}(z^{k+1}). \end{aligned}$$

If \(\mathcal {J}\) were smooth, then \(\partial \mathcal {J}\) is just the gradient of the reduced objective functional, which we usually calculate in PDE-constrained optimization by solving an adjoint equation. However here, \(\mathcal {J}\) is non-smooth and non-convex. In order to obtain a generalized adjoint state for the reduced objective functional we require the so-called “coderivatives.”

Definition 8 (Coderivatives)

Let X be a Hilbert space, \(\varPhi :X \rightrightarrows X^*\), and y ∈ Φ(x), i.e., \((x,y)\in \mathop {\mathrm {Graph}}\varPhi \). The regular (Fréchet) coderivative of Φ at (x, y) is the multifunction \(\widehat {D}^*\varPhi (x,y)\colon Y^*\rightrightarrows X^*\) defined by

$$\displaystyle \begin{aligned} h^*\in\widehat{D}^*\varPhi (x,y)(d^*)\Longleftrightarrow(h^*,-d^*)\in\widehat{\mathcal{N}}_{\mathop{\mathrm{Graph}} \varPhi}(x,y). \end{aligned} $$
(34)

The limiting (Mordukhovich) coderivative DΦ(x, y) of Φ at \((x,y)\in \mathop {\mathrm {Graph}}\varPhi \) is similarly defined by

$$\displaystyle \begin{aligned} h^*\in D^*\varPhi(x,y)(d^*)\Longleftrightarrow(h^*,-d^*)\in \mathcal{N}_{\mathop{\mathrm{Graph}}\varPhi}(x,y). \end{aligned} $$
(35)

For example, if Φ = S𝜖 from Section 4.1, then the coderivatives coincide and we have:

$$\displaystyle \begin{aligned} \widehat D^*S_{\epsilon}(w,u)(w^*)=D^*S_{\epsilon}(w,u)(w^*) = S_{\epsilon}^{\prime}(w)^*w^*, \quad w^*\in X^*, \end{aligned}$$

i.e., \(\widehat D^*S_{\epsilon }(w,u)(w^*)\) yields the usual adjoint state p obtained by solving the associated linear elliptic PDE with − w on the right-hand side.

For the tracking-type objective in Example 2, it was argued in [44, Prop. 1] that

$$\displaystyle \begin{aligned} \partial \mathcal{J}(z) \subset \alpha z+ B^*D^*S(B z,u)(u- u_d). \end{aligned} $$
(36)

Here, it follows from [66, Thm 4.44] that p ∈ DS(Bz, u)(u − ud) is a solution to the generalized adjoint equation:

$$\displaystyle \begin{aligned} A^* p + D^* \mathcal{N}_{K_0}(u,\xi)(p) \ni u_d - u, \end{aligned} $$
(37)

where \(\xi = Bz + f - Au \in \mathcal {N}_{K_0}(u)\). Therefore, assuming that (37) were solvable, we could fashion our projected subgradient method as in Algorithm 5.4.

Algorithm 5.4 Limiting projected subgradient algorithm

Since we have no efficient means of handling \(D^* \mathcal {N}_{K_0}(u^{k+1},\xi ^{k+1})(p)\), Algorithm 5.4 is impractical, especially when we consider that subgradient methods potentially require many iterations even for favorable convex problems. We therefore propose an alternative in Algorithm 5.5. The formal derivation for the approximate generalized adjoint state is based on simple geometric observations for a related finite-dimensional setting.

Consider that the (convex) normal cone \(\mathcal {N}_{K_0}\) is generated by the (convex) subdifferential of the indicator functional \(i_{K_0}\). Since the functionals ϕ𝜖 in (13) converge in a variational sense to \(i_{K_0}\), they provide a viable candidate for approximating elements of \(\mathcal {N}_{K_0}\). Moreover, for any u ∈ V , ∇ϕ𝜖(u) = −𝜖−1(−u)+. Comparing as 𝜖 0, it appears (at least in finite dimensions) that \( \mathop {\mathrm {Graph}} \nabla \phi ^{\epsilon } \to \mathop {\mathrm {Graph}} -\mathcal {N}_{K_0}\), see Figure 2. This behavior transfers to \(\widehat {\mathcal {N}}_{ \mathop {\mathrm {Graph}} -\mathcal {N}_{K_0}}(u,\xi )\) and \(\widehat {\mathcal {N}}_{ \mathop {\mathrm {Graph}} \nabla \phi ^{\epsilon }}(u,\xi )\), cf. Figure 2 with \(\varTheta := \mathop {\mathrm {Graph}} -\mathcal {N}_{\mathbb {R}_+}\) and \(\varLambda := \mathop {\mathrm {Graph}} \nabla \phi ^{0.1}\).

Fig. 2
figure 2figure 2

From left to right: \( \mathop {\mathrm {Graph}} \nabla \phi ^{\epsilon }\) (𝜖 = 1, 0.5, 0.1) versus \( \mathop {\mathrm {Graph}} \mathcal {N}_{ \mathbb {R}_+}\); \(\widehat {\mathcal {N}}_{ \mathop {\mathrm {Graph}} -\mathcal {N}_{ \mathbb {R}_+}}(u,\xi )\) for (u, ξ) = (1, 0), (0, 1), (0, 0); \(\widehat {\mathcal {N}}_{ \mathop {\mathrm {Graph}}\nabla \phi {0.1}}(u,\xi )\) for (u, ξ) = (1, 0), (0, 1), (0, 0)

Using the information in Figure 2, we can first calculate the limiting normal cones \(\mathcal {N}_{\varTheta }(u,\xi )\), from which we obtain: \(D^*\mathcal {N}_{\varTheta }(1,0)(p) = \{0\}, \text{ for all } p,\) and

$$\displaystyle \begin{aligned} D^*\mathcal{N}_{\varTheta}(0,1)(p) = \mathbb{R}, \text{ if } p = 0,\end{aligned} $$

otherwise \(D^*\mathcal {N}_{\varTheta }(0,1)(p) = \emptyset \). The most interesting case is:

$$\displaystyle \begin{aligned} D^*\mathcal{N}_{\varTheta}(0,0)(p) = \left\{\begin{array}{ll} \{0\},& \text{ for all } p < 0,\\ \mathbb{R},& \text{ if } p = 0\\ \mathbb{R}_{-},& \text{else.} \end{array}\right.\end{aligned} $$

Similarly, we have for all p

$$\displaystyle \begin{aligned} D^*\mathcal{N}_{\varLambda}(1,0)(p) = \{0\},\quad D^*\mathcal{N}_{\varLambda}(0,1)(p) = -\epsilon^{-1} p;\end{aligned} $$

and

$$\displaystyle \begin{aligned} D^*\mathcal{N}_{\varLambda}(0,0)(p) = \left\{ \begin{array}{ll} \{0\}\cup \{ -\epsilon^{-1} p\},& \text{ for all } p < 0,\\ \{0\}, & \text{ if } p = 0,\\ \left[-\epsilon^{-1}p,0\right] & \text{else.} \end{array} \right. \end{aligned}$$

Though certainly more tractable numerically, \(D^*\mathcal {N}_{\varLambda }\) is still a set-valued mapping with non-convex images. We therefore suggest the following single-valued mapping:

$$\displaystyle \begin{aligned} \widetilde{D}^*\mathcal{N}_{\varLambda}(0,0)(p) := \left\{q : q = -\epsilon^{-1}\chi_{\{u = 0\}}p \right\} \end{aligned}$$

where χ{u=0} is the characteristic function for the active set. This mapping coincides with the limiting coderivative \(D^*\mathcal {N}_{\varTheta }\) on the inactive set and strongly active set, whenever \(D^*\mathcal {N}_{\varTheta }\) is non-empty. On the biactive set, it is either contained in \(D^*\mathcal {N}_{\varTheta }\) or approaches it for 𝜖 0, cf. Figure 3.

Fig. 3
figure 3figure 3

Blue: \(D^*\mathcal {N}_{\varTheta }(0,0)(p)\), Red: \(D^*\mathcal {N}_{\varLambda }(0,0)(p)\), Black: \(\widetilde {D}^*\mathcal {N}_{\varLambda }(0,0)(p)\); 𝜖 = 0.5

By extrapolating these ideas from this simple one-dimensional geometric study to the infinite-dimensional setting, we arrive at our proposed algorithm.

Algorithm 5.5 Approximate projected subgradient algorithm

Remark 4

In fact, − χ{u≤0}p is nothing more than the adjoint of the Newton derivative for the non-smooth Nemytskii operator \((- \cdot )_+ : H^1_0(\varOmega ) \to L^2(\varOmega )\).

Using the analytical techniques described throughout the text, we have the next result.

Theorem 9

Suppose that Z ad is bounded. Then, any sequence {(zk, uk, ξk, pk, λk)}⊂ Z × V × V× V × Vgenerated by Algorithm5.5is bounded. Here,

$$\displaystyle \begin{aligned} \lambda_k := -\frac{1}{\epsilon_k} \chi_{\{ u^{k} \le 0 \}}p^k. \end{aligned}$$

Moreover, any weak accumulation point of (z, u, ξ, p, λ) will be feasible for (7) and we have

$$\displaystyle \begin{aligned} \lim_{k\to \infty}\int_{\{u^k = 0\}}|p^k|{}^2 \mathrm{d}x = 0,\quad \langle \lambda^\star,p^\star\rangle \le 0, \quad \langle \lambda^\star,u^\star \rangle = 0. \end{aligned} $$
(38)

Proof

Since Zad is bounded, {zk} is bounded in Z. It immediately follows that {uk} is bounded, since

$$\displaystyle \begin{aligned} c\|u^k\|{}^2_V \le \langle Au^k,u^k \rangle - \langle \xi^k, u^k\rangle = \langle Bz^k + f, u^k\rangle \Rightarrow c\|u^k\|{}_V \le \| Bz^k + f \|{}_{V^*} \end{aligned}$$

and ξk = Bzk + f − Auk. Similarly, we obtain the boundedness of {pk} in V  and {λk} in V:

$$\displaystyle \begin{aligned} c\|p^k\|{}^2_V \le \langle A^*p^k,p^k \rangle + \langle \lambda^k, p^k\rangle = \langle u_d - u^k,p^k\rangle \Rightarrow c\|p^k\|{}_V \le \| u_d - u^k\|{}_{V^*} \end{aligned}$$

and λk = ud − uk − Apk. Furthermore, since

$$\displaystyle \begin{aligned} 0 \le \epsilon_k\langle A^*p^k,p^k \rangle + \int_{\{u^k = 0\}}|p^k|{}^2 \mathrm{d}x = \epsilon_k\langle u_d - u^k,p^k\rangle \end{aligned}$$

the limit condition in (38) holds. Moreover, will can again show that 〈λ, p〉≤ 0 using the same argument as in (19) and by definition 〈λk, uk〉 = 0. Since B is compact, uk → u in V  (along a subsequence). This yields 〈λ, u〉 = 0.

The purpose of Theorem 9 is to show that Algorithm 5.5 produces a sequence with a weak accumulation point that satisfies a kind of limiting C-stationarity system. However, the lag in indices prevents us from closing the argument by proving that {zk} fulfils (9) (regardless of whether we choose fixed, bounded, or diminishing step sizes). Nevertheless, we will see later in the bundle-free approach that a variant of our approximate adjoint equation can under certain circumstances yield descent directions for the reduced objective.

We now demonstrate the performance of the algorithm on an example with a nontrivial biactive set: Example 3. In order to assure feasibility at every step, we solve the variational inequality with the primal-dual active set (PDAS)/semismooth Newton method from [38]. Note that although the solver for the variational inequality is mesh dependent, the majority of the linear solves are done within the first four iterations, see Table 2. For a graph of the behavior of the residuals as well as plots of the solution, see Figure 4.

Fig. 4
figure 4figure 4

Clockwise from upper left: residual of strong stationarity for Algorithm 5.5, optimal control z, state u, multiplier λ, adjoint p, and multiplier ξ

Table 2 Outer loop k vs. inner loop iterations “iter” for PDAS with tol = 10−10 used in Algorithm 5.5

We again use a uniform grid with 5122 grid points and start the algorithm at (z0, u0, ξ0) = (0, 0, 0). We choose the a priori step sizes νk := (k)−1∕2 and update 𝜖k according to 𝜖k := 10−4∕2k. The inner PDAS solver stops once the residual of the non-smooth system of equations reaches a tolerance of 10−10. Though the theory does not provide a stopping criterion, we check the residual of strong stationarity, which reaches \(\mathcal {O}(10^{-9})\) after 30 iterations. The residual is calculated using a discrete approximation of the following quantity:

$$\displaystyle \begin{aligned} \mathrm{res^k} := \| z^{k} - \mathrm{Proj}_{Z_{\mathrm{ad}}}\, (z^k - g^k)\|{}_{L^2} + \|Au^{k} - \xi^{k} - z^{k} - f\|{}_{L^2} + \|\max(0,-u^{k})\|{}_{L^2}+ \\ \|\max(0,-\xi^k)\|{}_{L^2} + |(\xi^k,u^k)_{L^2}| + \|Ap^{k} - \lambda^{k} + u^{k} -u_{d}\|{}_{L^2} + |\max(0, \langle \lambda^k,p^k\rangle)|+ \\ \|\chi_{\varOmega^{0+}_k}p^k\|{}_{L^2} + \|\chi_{\mathcal{I}_k}\lambda^k\|{}_{L^2} + \|\chi_{\varOmega^{00}_k}\max(0,p^k)\|{}_{L^2} + \|\chi_{\varOmega^{00}_k}\max(0,-\lambda^k)\|{}_{L^2}\end{aligned} $$
(39)

Despite lacking a convergence theory, the algorithm performs very well on this large-scale nontrivial problem. Indeed, counting all the free variables (z, u, ξ, etc.) there are over 106 degrees of freedom. Moreover, the presence of biactivity means that the example considered is genuinely non-smooth and non-convex. This is particularly encouraging, as the algorithm clearly outperforms the adaptive penalty method in terms of ease of implementation, size and structure of the systems of linear equations, and order of accuracy (almost reaching even strong stationarity). In particular, we note the stark contrast to the “sharpness” of the solution in comparison to the smooth method.

5.2 A Direct Solver for C-Stationarity Conditions

In this section, we adapt a method from [36] to the canonical MPEC (7). We work in the setting of Example 2 and assume that

$$\displaystyle \begin{aligned} Z_{\mathrm{ad}} := \left\{v \in L^2(\varOmega) \left|\; a \le v \le b, \text{ a.e. } x \in \varOmega \right.\right\} \end{aligned}$$

with a, b ∈ L2(Ω), a < b. We first state the algorithm. Then, we motivate the steps of the algorithm and discuss its convergence properties.

The formal derivation of Algorithm 5.6, which we describe in detail below, follows several basic steps: Fix a control and estimate the active and inactive sets; ignoring biactivity, use (28e) to approximate (28b)–(28e) as a system of equations; solve the reduced system (including (28a)) to obtain an update; test the residual of C-stationarity, if necessary, return to 1.

Algorithm 5.6 An active-set equality-constrained newton solver w. Feasibility restoration

More specifically, suppose we are in the infinite-dimensional setting. Assuming that ξ is sufficiently regular, then Equations (28c), (28d) can be understood as the following system of smooth and non-smooth equations:

$$\displaystyle \begin{aligned} Au^{\star} - \xi^{\star} &= Bz^{\star} + f, {} \end{aligned} $$
(40a)
$$\displaystyle \begin{aligned} \xi^{\star} &= (\xi^{\star} - cu^{\star})_+, {} \end{aligned} $$
(40b)

where c > 0 is some scaling constant. Since ξ ∈ L2(Ω), the Newton derivative for the plus function described at the end of Section 4.1 is not valid in this setting (since both the domain and range here must be taken as L2(Ω)). On a discrete level, this is not an issue. For some fixed z, solving (40) gives a pair (u, ξ) along with active and inactive sets:

$$\displaystyle \begin{aligned} \varOmega^0(u^{\star}) := \left\{x \in \varOmega \left|\; u^{\star}(x) = 0 \right.\right\}, \quad \varOmega^+(u^{\star}) := \left\{x \in \varOmega \left|\; u^{\star}(x) > 0 \right.\right\}.\end{aligned} $$

As in our discussion of (28), if we treat the variables u, λ as finite-dimensional vectors and the complementarity condition as the pointwise product of u and λ, then the complementarity condition would indicate that λ = 0 on the inactive set. Analogously, we take p = 0 on the (entire) active set, thus ignoring biactivity. Consequently the remaining sign condition in (28) holds. Finally, using the projection formula (23) along with a semismooth Newton step, we can handle the variational inequality (28a). We recall the sets

$$\displaystyle \begin{aligned} \varOmega^a &:= \left\{ x \in \varOmega \left| a(x) + \alpha^{-1}(B^*p^{\star})(x) > 0\right.\right\},\\ \varOmega^b.&:= \left\{ x \in \varOmega \left| -\alpha^{-1}(B^*p^{\star})(x)- b(x) > 0\right.\right\}.\end{aligned} $$

and Ωina := Ω ∖ (Ωa ∪ Ωb).

Now, supposing we want a new approximation (u, z, p) via u + δu, z + δz, and p + δp, we consider the reduced system by eliminating the dual variables:

$$\displaystyle \begin{aligned} \delta z +\alpha^{-1} \chi_{\varOmega^{ina}}B^*\delta p &= \mathrm{Proj}_{Z_{\mathrm{ad}}}(-\alpha^{-1}B^*p^{\star}) - z^{\star} &\text{ on } \varOmega, {} \end{aligned} $$
(41a)
$$\displaystyle \begin{aligned} A^* \delta p + \delta u &= u_d - u^{\star} -A^*p^{\star}, &\text{ on } \varOmega^+(u^{\star}), {} \end{aligned} $$
(41b)
$$\displaystyle \begin{aligned} \delta p &= 0, &\text{ on } \varOmega^0(u^{\star}), {} \end{aligned} $$
(41c)
$$\displaystyle \begin{aligned} A \delta u - B\delta z &= Bz^{\star} + f - Au^{\star}, &\text{ on } \varOmega^+(u^{\star}), {} \end{aligned} $$
(41d)
$$\displaystyle \begin{aligned} \delta u &= 0, &\text{ on } \varOmega^0(u^{\star}), {}\end{aligned} $$
(41e)

If we replace δz in (41d), then we get the smaller system in (δu, δp):

$$\displaystyle \begin{aligned} A^* \delta p + \delta u &= u_d - u^{\star} -A^*p^{\star}, &\text{ on } \varOmega^+(u^{\star}), {} \end{aligned} $$
(42a)
$$\displaystyle \begin{aligned} \delta p &= 0, &\text{ on } \varOmega^0(u^{\star}), {} \end{aligned} $$
(42b)
$$\displaystyle \begin{aligned} A \delta u + \alpha^{-1} B \chi_{\varOmega^{ina}}B^*\delta p &= B \mathrm{Proj}_{Z_{\mathrm{ad}}}(-\alpha^{-1}B^*p^{\star}) + f -Au^{\star}, &\text{ on } \varOmega^+(u^{\star}), {} \end{aligned} $$
(42c)
$$\displaystyle \begin{aligned} \delta u &= 0, &\text{ on } \varOmega^0(u^{\star}), {} \end{aligned} $$
(42d)

which is remarkably similar to the semismooth Newton step in the adaptive penalty method. Unlike the smooth methods, however, a proof of convergence remains elusive. The main culprit here is clearly the sequence \(\{\varOmega ^+_k\}\), which need not converge with respect to any notion of set convergence, e.g., Painlevé-Kuratowski or in the sense of characteristic functions.

Note also that this formally derived system is well-defined in function space provided that we have enough regularity. For example, provided that the sets Ω+(u), Ω0(u) are sufficiently regular, we could reduce the search for \((\delta u, \delta p) \in \mathcal {H} \times \mathcal {H}\) with \(\mathcal {H} := H^1_0(\varOmega ^+(u^{\star }))\), solve the associated weak form of (42a), (42b), and extend the solutions by zero on Ω0(u).

An obvious stopping criterion for Algorithm 5.6 would be the residual of (28) up to some user-defined tolerance (as suggested in [36]). Moreover, we see that the computational effort is roughly that of the adaptive penalty method. In fact, there is much less nonlinearity here due to a lack of the penalty terms ϕε. We also mention that the MPECs in [36] are much more challenging than (7), as the controls there arise inside the differential operator. Nevertheless, the algorithm seems to perform very well, even on examples with nontrivial biactive sets.

We demonstrate the performance of Algorithm 5.6 on Example 3. As expected, Algorithm 5.6 behaves like a second-order method, see Table 3. We once again used the PDAS/semismooth Newton method in [38] to restore feasibility at every step. Moreover, since the multiplier λ is eliminated from the algorithm, we artificially reintroduce it for the calculation of the residuals.

Table 3 Residuals of strong stationarity for Algorithm 5.6 with stopping tolerance tol = 10−7

The solutions look identical to those plotted in Figure 5. We therefore only provide images of the biactive sets for y and upper and lower active sets for z. We note that this algorithm also performs quite well, reaching a residual of strong stationarity on the order of \(\mathcal {O}(10^{-8})\) within k = 5 iterations, though it never reaches \(\mathcal {O}(10^{-9})\) in contrast to the approximating subgradient algorithm. In addition, the effort to solve each step is higher, as seen in Table 4.

Fig. 5
figure 5figure 5

Plot of solutions using Algorithm 5.6. Left to right: characteristic functions \(\chi _{\varOmega ^{00}_{\star }}\)(set of all indices with \(u^{\star }_i \le \)1e−8 and \(\xi ^{\star }_i \le 0\)), \(\chi _{\varOmega ^{b}_{\star }}\) (set of all indices with \(z^{\star }_i \le b_i-\)1e−6), and \(\chi _{\varOmega ^{a}_{\star }}\) (set of all indices with \(z^{\star }_i \ge a_i+\)1e−6)

Table 4 Outer loop k vs. inner loop “iter” iterations for PDAS with tol = 10−10 used in Algorithm 5.6

Our next non-smooth method seeks to overcome the theoretical deficiencies of Algorithm 5.5 and 5.6. In some sense, it takes a step towards bridging the gap between this active-set-based solver and the approximate projected subgradient method in the previous subsection.

5.3 The Bundle-Free Implicit Programming Method

We now present the bundle-free implicit programming approach from [46], which we extend for control constraints. In contrast to the active-set method in the previous subsection, this method is based off of B-stationarity conditions. We must therefore assume that S is directionally differentiable.

We first state the basic assumptions and the algorithm. Afterwards, we discuss the motivations for the steps and examine the convergence properties. Throughout this subsection, let \(Z_{\mathrm {ad}} := \left \{z \in L^2(\varOmega ) \left |\; a \le z \le b \right .\right \}\) with a, b ∈ L2(Ω) and a < b and assume that the variational inequality is defined as in (7). We otherwise work in the abstract framework of (6), where \( \mathcal {J}(z) := F(S(B z)) + G(z) \) and F and G are continuously Fréchet differentiable. By assumption, ξ ∈ H.

For some constant cq > 0, let q(⋅) := cq∥⋅∥2∕2 and for z ∈ Zad define the local quadratic models:

$$\displaystyle \begin{aligned} \mathcal{M}(\cdot) &:= q(\cdot) + F'(z;S'(Bz; B\cdot)) + G'(z; \cdot),\\ \mathcal{M}_{\epsilon}(\cdot) &:= q(\cdot) + F'(z;d_{\epsilon}(B\cdot)) + G'(z;\cdot) ,\,\,\epsilon > 0, \end{aligned} $$

where d𝜖(⋅) is a smooth approximation of S(Bz;⋅) such that d𝜖(0) = 0. Here, we suggest letting d𝜖(w) = d, the solution to

$$\displaystyle \begin{aligned} A d + \epsilon^{-1} \chi_{\varOmega^{0+}} d - \chi_{\varOmega^{0}}\nabla^2 \phi^{\epsilon}(-d) = w, \end{aligned}$$

where w ∈ V, 𝜖 > 0, and u = S(Bz). This is related to the true formula for d = S′(Bz;w) given by

$$\displaystyle \begin{aligned} \text{Find } d \in \mathcal{K}(u,\xi): \int_{\varOmega} \nabla d\cdot \nabla [v-d] \mathrm{d}x \ge \langle w, v - d \rangle,\quad \text{ for all } v \in \mathcal{K}(u,\xi), \end{aligned} $$
(43)

where \( \mathcal {K}(u,\xi ): = \left \{ u \in H^1_0(\varOmega ) \left |\; d(x) \ge 0,\; \text{ q.e. } x \in \varOmega ^{0},\; d(x) = 0. \text{ a.e. }x \in \right .\right .\)\(\left .\left .\varOmega ^{0+}\right .\right \};\) see [63] (or [15, Chap. 6] for an English summary). As this formula makes use of quasi-continuity and potentially non-negligible sets of Lebesgue measure zero, it is unclear how to make direct use of the nonnegativity condition in a numerical method. To circumvent this issue, a certain regularity condition is assumed throughout (as in [46]):

  1. 1.

    \(\varOmega ^{0} = \overline {\text{int}(\varOmega ^{0})},\)

  2. 2.

    If m(Ω00) = 0, then cap(Ω00) = 0 or S′(Bu;Bh) = d = 0 q.e. x ∈ Ω0.

  3. 3.

    \(\text{ If } m(\varOmega ^{00}) > 0, \text{ then } \exists \nu > 0, \exists \gamma ' > 0: \forall \gamma \ge \gamma ', d_{\gamma ,\varepsilon } \ge 0,\,a.e.\, \mathcal {A}_{\nu }\setminus \varOmega ^{0})\\ \text{where } \mathcal {A}_{\nu } := \left \{x \in \varOmega \left | \text{dist}(x,\varOmega ^{0}) < \nu \right .\right \}. \)

In light of this assumption, we speak of “negligible” biactive sets whenever m(Ω00) = 0. Note that S is in fact Gâteaux differentiable, whenever this condition holds. This once again highlights a fundamental difference between the optimal control of PDEs and variational inequalities, even for the simplest variational inequality of interest.

We also suggest the smooth approximation of the directional derivative in order to guarantee that \(p_{\epsilon } = d^{\prime }_{\epsilon }(0)^*w\) solves

$$\displaystyle \begin{aligned} A^* p + \epsilon^{-1} \chi_{\varOmega^{0+}} p = w \end{aligned}$$

and that {p𝜖}𝜖>0 is uniformly bounded in V . Notice the similarity to the approximating limiting subgradient suggested earlier.

Algorithm 5.7 A bundle-free implicit programming approach

In Algorithm 5.7, \( \mathcal {T}_{Z_{\mathrm {ad}}}(z)\) is the tangent cone to Zad at z ∈ Zad, which is defined by

$$\displaystyle \begin{aligned} \mathcal{T}_{Z_{\mathrm{ad}}}(z) := -\left[ \mathcal{N}_{\mathrm{ad}}(z)\right]^{+} = \overline{\mathbb{R}_{+}(Z_{\mathrm{ad}} - z)}. \end{aligned}$$

We use the generalized Armijo line search (cf. [13].): Set \(\tau ^k := \beta ^{m_k} s\) where mk is the first nonnegative integer such that

$$\displaystyle \begin{aligned} \mathcal{J}(z^k) - \mathcal{J}(z^k(\beta^{m} s)) \ge \frac{\sigma}{\beta^{m}s^{}}\|z^k - z^k(\beta^{m} s)) \|{}^2_{L^2}, \end{aligned} $$
(46)

where β, σ ∈ (0, 1), s > 0 and, given a step δz, we set

$$\displaystyle \begin{aligned} z(\tau) := \mathrm{Proj}_{Z_{\mathrm{ad}}}(z +\tau \delta z),\; \tau > 0. \end{aligned}$$

Since much of Algorithm 5.7 is derived from B-stationarity conditions, we restate them here:

$$\displaystyle \begin{aligned} \mathcal{J}'(z;w -z) = F'(S(Bz))S'(Bz;B(w-z)) + G'(z)(w-z)\ge 0,\quad \forall w \in Z_{\mathrm{ad}}. \end{aligned} $$
(47)

From (47), it is clear that we can equivalently reformulate B-stationarity as:

$$\displaystyle \begin{aligned} \mathcal{J}'(z;\delta z) \ge 0,\,\,\forall \delta z \in \mathcal{T}_{Z_{\mathrm{ad}}}(z). \end{aligned}$$

Moreover, since \(\mathcal {J}'(z;\delta z)\) is positively homogeneous in δz, it follows that δz = 0 is a minimizer of \(\mathcal {J}'(z;\delta z)\) over \(\mathcal {T}_{Z_{\mathrm {ad}}}(z)\), whenever z is B-stationary. Finally, given that \(\mathcal {T}_{Z_{\mathrm {ad}}}(z)\) is a non-empty, closed, and convex cone, we can add the coercive quadratic form q to the definition of B-stationarity without altering the characterization, cf. the general analysis in [46, Section 2]. Therefore, if z is B-stationary, then 0 ∈ Z solves the auxiliary problem

$$\displaystyle \begin{aligned} \min\left\{ \mathcal{M}(\delta z) := q(\delta z) + \mathcal{J}'(z;\delta z) \text{ over } \delta z \in \mathcal{T}_{Z_{\mathrm{ad}}}(z)\right\}. \end{aligned} $$
(48)

Now, if the biactive set is negligible, then S is in fact Gâteaux differentiable. In this case, (48) has a unique solution given by

$$\displaystyle \begin{aligned} (\delta z^* + c_{q} ^{-1}(B^*S'(Bz)^*\nabla F'(S(Bz)) + \nabla G(z)), w - \delta z^*) \ge 0,\,\,\forall w \in \mathcal{T}_{Z_{\mathrm{ad}}}(z), \end{aligned}$$

which, noting that \( \nabla \mathcal {M}(0) = B^*S'(Bz)^*\nabla F'(S(Bz)) + \nabla G(z)\), is equivalent to (??). Furthermore, in this smooth setting, we can prove the following result.

Proposition 1

Let z  Zad, u = S(Bz), ξ = Au  Bz  f, and suppose that\(\mathcal {J}\)is Gâteaux differentiable at z. If z is not B-stationary, then (46) stops in a finite number of steps.

Proof

Suppose δz is given by the projection formula (??). Since δz is the unique global optimum and \(0 \in \mathcal {T}_{Z_{\mathrm {ad}}}(z)\), we have

$$\displaystyle \begin{aligned} q(\delta z) + \mathcal{J}'(z;\delta z) < 0 \Longrightarrow \mathcal{J}'(z;\delta z) \le -\frac{c_q}{2}\|\delta z\|{}^2_{L^2} \end{aligned} $$
(49)

Moreover, for any τ > 0, we have \(z(\tau ) := \mathrm {Proj}_{Z_{\mathrm {ad}}}(z-\tau \delta z)\) and since \(\mathrm {Proj}_{Z_{\mathrm {ad}}}\) is non-expansive:

$$\displaystyle \begin{aligned} \|z-z(\tau)\|{}_{L^2} = \|\mathrm{Proj}_{Z_{\mathrm{ad}}}(z+\tau \delta z) - \mathrm{Proj}_{Z_{\mathrm{ad}}}(z) \|{}_{L^2} \le \tau \|z\|{}_{L^2}. \end{aligned} $$
(50)

Furthermore, since z ∈ Zad, \(\delta z \in \mathcal {T}_{Z_{\mathrm {ad}}}(z)\), and Zad is defined by simple pointwise bound constraints in L2(Ω), we appeal to the proof of Lemma 6.34 [15], which shows that

$$\displaystyle \begin{aligned} \tau^{-1}(z(\tau)-z) \to \delta z \text{ as } \tau \downarrow 0. \end{aligned} $$
(51)

Finally, suppose that (46) fails for all τ > 0. Then,

$$\displaystyle \begin{aligned} \tau^{-1}(\mathcal{J}(z(\tau)) - \mathcal{J}(z))> -\sigma\tau^{-2}\|z-z(\tau)\|{}^2_{L^2},\quad \forall \tau > 0. \end{aligned}$$

On the right side of the inequality, we can estimate from below using (50):

$$\displaystyle \begin{aligned} -\sigma\tau^{-2}\|z-z(\tau)\|{}^2_{L^2} \ge -\sigma \|\delta z\|{}^2_{L^2} \end{aligned}$$

Now, letting z(τ) = z + ττ−1(z(τ) − z) = z + τdτ, where dτ → δz by (51), we have by (49):

$$\displaystyle \begin{aligned} \tau^{-1}(\mathcal{J}(z(\tau)) - \mathcal{J}(z)) = \tau^{-1}(\mathcal{J}(z + \tau d_{\tau}) - \mathcal{J}(z)) \to \mathcal{J}'(z; \delta z) \le -\frac{c_q}{2}\|\delta z\|{}^2_{L^2}. \end{aligned}$$

But, then

$$\displaystyle \begin{aligned} -\frac{c_q}{2}\|\delta z\|{}^2_{L^2} \ge \mathcal{J}'(z; \delta z) \ge -\sigma \|\delta z\|{}^2_{L^2}, \end{aligned}$$

a contradiction, since σ ∈ (0, cq∕2).

Proposition 1 provides a justification for steps 4:–6: in Algorithm 5.7. Note that the calculation of the gradient \(\nabla \mathcal {M}^k(0)\) requires the solution of an adjoint equation. For example, in the setting of Example 2, \(\nabla \mathcal {M}^k(0) = \alpha z^k - B^*p^k\), where pk solves (here in strong form):

$$\displaystyle \begin{aligned} \begin{array}{rclr} A^* p &=& u_d - u^{k} &\text{ on } \varOmega^+_k,\\ p &=& 0, &\text{ on } \varOmega^0_k, \end{array} \end{aligned} $$
(52)

Turning now to the case when the biactive set is non-negligible, we can easily adjust the proof of Proposition (1) for the non-smooth setting.

Corollary 2

Let z  Zad, u = S(Bz), ξ = Au  Bz  f, and suppose that δz is minimizer for (48). If z is not B-stationary, then (46) stops in a finite number of steps.

Nevertheless, whenever the biactive set is non-negligible, the directional derivative is nonlinear in δz. In particular, (48) is non-convex, which was a key assumption in the arguments. We therefore need an alternative procedure to calculate the step δz.

In Algorithm 5.7, we suggested the step \( \delta z = \mathrm {Proj}_{\mathcal {T}_{Z_{\mathrm {ad}}}(z^k)}\left [-c_q^{-1} \nabla \mathcal {M}^k_{\epsilon ^k}(0)\right ], \) which is related to the smoothed auxiliary problem using the 𝜖-dependent model \(\mathcal {M}_{\epsilon }\):

$$\displaystyle \begin{aligned} \min\left\{\mathcal{M}_{\epsilon}(\delta z) := q(\delta z) + F'(z;d_{\epsilon}(B\delta z)) + G'(z)\delta z \text{ over } \delta z \in \mathcal{T}_{Z_{\mathrm{ad}}}(z)\right\}. \end{aligned} $$
(53)

Note if the approximation was chosen so that d𝜖 → S and 0 solves (53) for all sufficiently small 𝜖 > 0 (or at least along some null sequence), then z must be B-stationary as 0 solves (48), as well. In fact, in the current setting, A is symmetric so we can use the approximation results in [7] (discussed above), see also [46], to argue that if δz𝜖 → δz weakly in L2(Ω), then

$$\displaystyle \begin{aligned} d_{\epsilon}(B \delta z_{\epsilon}) \to S'(Bz; B\delta z), \text{ as } \epsilon \downarrow 0, \end{aligned}$$

provided that B is completely continuous, e.g., when B is the embedding of L2(Ω) into H−1(Ω).

On the other hand, if there exists some \(\delta z \in \mathcal {T}_{Z_{\mathrm {ad}}}(z)\) such that \( ( \nabla \mathcal {M}_{\epsilon }(0), \delta z ) < 0, \) then by continuity, there is some η𝜖 > 0 such that \(\mathcal {M}_{\epsilon }(\eta _{\epsilon } \delta z) < 0\). Since \(\mathcal {T}_{Z_{\mathrm {ad}}}(z)\) is a cone, \(\eta _{\epsilon } \delta z \in \mathcal {T}_{Z_{\mathrm {ad}}}(z)\) and 0 does not solve (53). If this persists as 𝜖 0, then z cannot be B-stationary.

In light of this, consider our δz update. Let \(w := d^{\prime }_{\epsilon }(0)^*F'(z) + G'(z) = \nabla \mathcal {M}_{\epsilon }(0)\) and note that

$$\displaystyle \begin{aligned} \mathcal{T}_{Z_{\mathrm{ad}}}(z) = \left\{\delta z \in L^2(\varOmega) \left|\; \delta z \ge 0, \text{ a.e. on } \varOmega^{a},\; \delta z \le 0, \text{ a.e. on } \varOmega^{b}\right.\right\}, \end{aligned}$$

where \(\varOmega ^{a} := \left \{x \in \varOmega \left |\; z(x) = a(x) \right .\right \}\), \(\varOmega ^{b} := \left \{x \in \varOmega \left |\; z(x) = b(x) \right .\right \}\) and Ωina := Ω ∖ (Ωa ∪ Ωb). Then, using the basic properties of \(\mathrm {Proj}_{\mathcal {T}_{Z_{\mathrm {ad}}}(z)}\), we have

$$\displaystyle \begin{aligned}\displaystyle (\nabla \mathcal{M}_{\epsilon}(0),\delta z) =\\\displaystyle -c_{q}\left[\int_{\varOmega^{ina}} |\delta z|{}^2 + \int_{\varOmega^{a} \cap \{ -c_q^{-1} w \ge 0\}} |\delta z|{}^2 + \int_{\varOmega^{b}\cap \{-c_q^{-1} w \le 0\}} |\delta z|{}^2\right] \le 0. \end{aligned} $$

Assuming that the latter term is nonzero and expanding \(\mathcal {M}_{\epsilon }\) at zero in direction δz, we obtain:

$$\displaystyle \begin{aligned} \mathcal{M}_{\epsilon}(0 + \eta \delta z) = \eta\left( \frac{\eta c_{q}}{2}\|\delta z\|{}^2_{L^2} + (\nabla \mathcal{M}_{\epsilon}(0),\delta z) + o(1) \right). \end{aligned} $$
(54)

We may then choose a sufficiently small η𝜖 > 0 such that

$$\displaystyle \begin{aligned} \frac{\eta_{\epsilon} c_{q}}{2}\|\delta z\|{}^2_{L^2} + (\nabla \mathcal{M}_{\epsilon}(0),\delta z) + o(1) < 0. \end{aligned}$$

Hence, \(\mathcal {M}_{\epsilon }(\eta _{\epsilon } \delta z) \le 0\) and by definition

$$\displaystyle \begin{aligned} F'(z;d_{\epsilon}(B(\eta_{\epsilon} \delta z))) + G'(z)(\eta_{\epsilon} \delta z) \le - \eta^2_{\epsilon} q(\delta z_{\epsilon}). \end{aligned}$$

Therefore, if a uniform lower bound η > 0 with η𝜖 ≥ η > 0 exists, then we can prove that δz𝜖 is a descent direction for some sufficiently small 𝜖 > 0.

Proposition 2

Let z  Zad, u = S(Bz), and ξ = Au  Bz  f and suppose that\(\mathcal {J}\)is only directionally differentiable at z. Furthermore, let\( \delta z_{\epsilon } = \mathrm {Proj}_{\mathcal {T}_{Z_{\mathrm {ad}}}(z)}\left [-c_q^{-1} \nabla \mathcal {M}_{\epsilon }(0)\right ] \)and assume that\( \limsup _{\epsilon \downarrow 0} ( \nabla \mathcal {M}_{\epsilon }(0),\delta z_{\epsilon }) < 0. \)If there exists an η > 0 such that\(\mathcal {M}_{\epsilon }(\eta \delta z_{\epsilon }) \le 0\)for all sufficiently small 𝜖 > 0, then there exists some\(\widehat {\epsilon } > 0\)such that\(\delta z_{\widehat {\epsilon }}\)is a descent direction for\(\mathcal {J}\)at z.

Proof

By assumption, F′(z;d𝜖(B(ηδz𝜖)) + G′(z)(ηδz𝜖) ≤−η2q(δz𝜖). Then,

$$\displaystyle \begin{aligned}\displaystyle \mathcal{J}'(z;\eta \delta z_{\epsilon}) = (\mathcal{J}'(z; \eta \delta z_{\epsilon}) - F'(z;d_{\epsilon}(B(\eta \delta z_{\epsilon}))) - G'(z)(\eta \delta z_{\epsilon})) +\\ {} \displaystyle F'(z;d_{\epsilon}(B(\eta \delta z_{\epsilon})) + G'(z)(\eta \delta z_{\epsilon} ) \le \\ {} (\mathcal{J}'(z; \eta \delta z_{\epsilon}) - F'(z;d_{\epsilon}(B(\eta \delta z_{\epsilon}))) - G'(z)(\eta \delta z_{\epsilon} )) - \eta^2 q(\delta z_{\epsilon}) =\\ {} F'(z;S(Bz;B(\eta \delta z_{\epsilon}))) - F'(z;d_{\epsilon}(B (\eta \delta z_{\epsilon}))) - \eta^2 q(\delta z_{\epsilon}) =\\ {} \langle \nabla F(z), S(Bz;B(\eta \delta z_{\epsilon})) - d_{\epsilon}(B (\eta \delta z_{\epsilon}))\rangle - \eta^2 q(\delta z_{\epsilon}). \end{aligned} $$

Now, since \(\left \{p_{\epsilon }\right \}\) is bounded, we can show that \(\left \{\delta z_{\epsilon }\right \}\) is bounded. Hence, there is a subsequence (denoted still by 𝜖) such that \(\delta z_{\epsilon } \rightharpoonup \delta z^*\). Then, for sufficiently small \(\widehat {\epsilon } > 0\)

$$\displaystyle \begin{aligned} \langle \nabla F(z), S(Bz;B(\eta \delta z_{\widehat{\epsilon}})) - d_{\widehat{\epsilon}}(B (\eta \delta z_{\widehat{\epsilon}}))\rangle \le \eta^2 q(\delta z_{\widehat{\epsilon}})/2, \end{aligned} $$
(55)

Hence, \( \mathcal {J}'(z; \delta z_{\widehat {\epsilon }}) \le -c_{q}\eta \|\delta z_{\widehat {\epsilon }}\|{ }^2/4, \) as was to be shown.

Since a direct verification of the hypotheses is potentially too expensive from a computational standpoint, we suggest a heuristic. Fix a lower bound \( \underline {\eta } > 0\). If \((\nabla \mathcal {M}_{\epsilon }(0),\delta z) < -q(\delta z)\) and (55) (or an approximation as in [46, Remark 3.13]) holds with η = 1 (or \(\eta = \underline {\eta }\)), we use the current δz in the line search. Thus, the “descent condition” in 9: holds. If \(-q(\delta z) \le (\nabla \mathcal {M}_{\epsilon }(0),\delta z) < 0\), then we choose η > 0 such that \(\eta q(\delta z)+ (\nabla \mathcal {M}_{\epsilon }(0),\delta z) < 0\), set δz := ηδz, update \( \underline {\eta } := \min (\eta , \underline {\eta })\), and check (55) or an approximation. Here, the descent condition also holds, but the model \(\mathcal {M}_{\epsilon }\) might be failing. If (55) fails in the latter, then we go to step 10:. Finally, if \((\nabla \mathcal {M}_{\epsilon }(0),\delta z) = 0\), then we go to step 10: . In practice, one might also attempt to circumvent the verification of (55) by using a sufficiently small 𝜖 > 0 and decreasing at every step.

This heuristic has its limitations, e.g., it cannot prohibit \( \underline {\eta } \downarrow 0\). Therefore, if it appears that this is the case, we break the while loop and go to a “robustification step” in 15:. This just means that if the quadratic model \(\mathcal {M}_{\epsilon }\) appears to be ineffective, then we should calculate a new control z by using an alternative method that is guaranteed to converge and “restart” the algorithm. For instance, we could use the adaptive penalty method in Section 4.1 (for a fixed γ, increasing each time the robustification step is used).

Finally, we recall that in the classical projected-gradient approaches, the Lipschitz continuity of the gradient of the objective is essential for convergence proofs, cf. [13]. There, one can show that the line search will always stop provided that the step size is below a certain threshold which only depends on σ and the Lipschitz modulus L. This is, of course, not possible for the control of variational inequalities as the reduced objective is non-smooth (even when the biactive set is negligible we only have Gâteaux differentiability). Therefore, we must also monitor the behavior of the accepted step sizes at each iteration. If, as with \( \underline {\eta }\), the step sizes τk appear to be rapidly decreasing with each iteration, then we also make use of a robustification step. For a full convergence proof in the case of the canonical MPEC (excluding control constraints), we refer the interested reader to [46].

We conclude by demonstrating the performance of the bundle-free method on two examples, Example 3 and the following example (adapted from [39, Ex. 6.1] by adding control constraints).

Example 4

Here, we let a ≡ 0, b ≡ 0.8 and set α = 1. In addition, we define f and ud as follows:

$$\displaystyle \begin{aligned} z_1({\mathbf{x}}_1,{\mathbf{x}}_2) &= -4096{\mathbf{x}}_1^6 + 6144{\mathbf{x}}_2^5 -3072{\mathbf{x}}_1^4 + 512{\mathbf{x}}_2^3,\\ {} z_2({\mathbf{x}}_1,{\mathbf{x}}_2) &= -244.140625{\mathbf{x}}_1^6+585.9375{\mathbf{x}}_2^5 -468.75{\mathbf{x}}_2^4+125{\mathbf{x}}_2^3,\\ {} y^*({\mathbf{x}}_1,{\mathbf{x}}_2) &= \left\{\begin{array}{ll} z_1({\mathbf{x}}_1,{\mathbf{x}}_2)z_2({\mathbf{x}}_1,{\mathbf{x}}_2) & ({\mathbf{x}}_1,{\mathbf{x}}_2) \in ]0,0.5[\times ]0,0.8[,\\ {} 0 & \text{otherwise,} \end{array}\right.\\ u^* &{=} y^*,\quad \xi^*({\mathbf{x}}_1,{\mathbf{x}}_2) {=} 2\max(0,-|{\mathbf{x}}_1 {-} 0.8|{-}|{\mathbf{x}}_1{\mathbf{x}}_2 {-} 0.2|{-}0.3+0.35),\\ f &= -\varDelta y^* - u^* - \xi^*,\quad u_d = y^*+\xi^* -\alpha\varDelta u^*. \end{aligned} $$

In order to compare to the other methods, we again use a uniform grid with 5122 grid points and start the algorithm at (z0, u0, ξ0) = (0, 0, 0) for Example 3. In contrast, we use a random starting point when solving Example 4. We start the algorithm with 𝜖0 = 10−10 and subsequently set 𝜖k+1 = 𝜖k∕2.

For Example 3 we obtain the same solution as in all the previous algorithms. Likewise, the algorithm performs very well on Example 4, see Table 5 and Figure 6. We note, however, that Example 4 (when starting with a random initial guess) is more difficult to solve. In fact, once the difference in function values becomes negligible, i.e., on the order of \(\mathcal {O}(10^{-13})\), the residual of S-stationarity stagnates at \(\mathcal {O}(10^{-7})\). Finally, one important aspect of the theory for the bundle-free method was the relation \((\nabla \mathcal {M}_{\epsilon }(0),\delta z) < -q(\delta z)\). To see how the choice of 𝜖k influences this, see Table 6. There, we observe that far from the solution, a larger value of 𝜖k seems to yield a good approximation. However, the choice becomes critical once we close in on the solution.

Fig. 6
figure 6figure 6

Clockwise from upper left: characteristic functions \(\chi _{\varOmega ^{b}_{\star }}\) (red) and \(\chi _{\varOmega ^{a}_{\star }} \) (blue) (set of all indices with \(z^{\star }_i \ge b_i-\)1e−6 and \(z^{\star }_i \le a_i+\)1e−6), optimal control z, state u, multiplier λ, adjoint p, and multiplier ξ for Example 4

Table 5 Residuals, step sizes, and number of PDAS iterations (“iter”) for Algorithm 5.7 when used to solve Examples 3 and 4
Table 6 Behavior of Taylor expansion of \(\mathcal {M}_{\epsilon }(0+\delta z_k)\) at each iteration in Algorithm 5.7 when applied to Example 3

6 Conclusion

Despite being a topic of interest for several decades, the optimal control of variational inequalities continues to be an active field of research. The rapidly growing interest in the past decade appears to be a result of the many theoretical, algorithmic, and computational advances in PDE-constrained optimization to date. We have seen here that both reduced space and full-space approaches are possible; however, the difficulties due to either a non-smooth control-to-state mapping or degenerate complementarity constraints persist. The various techniques for deriving optimality conditions in the presence of non-smoothness or degeneracy have led both in finite and infinite dimensions to a hierarchy of first-order optimality conditions. Some of these conditions are directly related to function-space-based numerical methods (C-stationarity) whereas other conditions (e.g., those of Mignot and Puel) are derived using concepts of generalized differentiation and a fine analysis of the regularity properties of the underlying functions and multipliers. These facts should therefore always be taken into account when developing numerical optimization algorithms.

In our numerical studies, we considered two main types of solution methods: smooth and non-smooth. Using approximation techniques for variational inequalities as in [30, 31], the smooth methods are almost always available and allow us to immediately take advantage of existing solvers for (smooth) PDE-constrained optimization problems. For the smoothed/regularized problems, we are only limited by our knowledge of the corresponding parameter-dependent PDE-constrained optimization problem. In our study, we make use of a smooth continuation approach and solve the first-order conditions directly using a semismooth Newton method. For small penalty parameters 𝜖, the solution of the linear system (24) needed to calculate the updates is relatively well-behaved. However, the lower off-diagonal block becomes increasingly problematic as we attempt to approach the limiting problem for 𝜖 0. Thus, we eventually pay a major price for smoothing the original problem.

As mentioned in the introduction, we present here several possible non-smooth methods that mirror related approaches in smooth PDE-constrained optimization. The first method is inspired by the classical projected subgradient methods in [54, 76]. Just as in these classical methods, the relative cost of each iteration is roughly as cheap as a standard projected-gradient approach sans line search: solve the VI, then a linear elliptic PDE, and compute a pointwise projection. As with all subgradient gradient methods, the strong convergence statements are essentially limited to convex problems. Nevertheless, the method presented here behaves quite well in practice and thus, warrants a deeper study in future research.

The active-set method has been taken from [36] and adapted to the setting of the canonical example. At every iteration, the cost of solving the nonlinear system is slightly more than in the smooth continuation approach due to the feasibility restoration step, which requires the solution of a mixed complementarity problem. However, the conditioning issues are now absent as the potentially problematic perturbation in the lower off-diagonal blocks has been removed. Just as in [36], this method performs exceptionally well, even on a problem with persistent biactivity throughout the iterations. However, as mentioned in the text, proving convergence of this method is rather difficult. One possibility would be to make regularity and monotonicity arguments on the data and active sets throughout the iterations (as was done in [42] for a related problem).

Finally, we considered the bundle-free implicit programming approach, which can be thought of as a globalization of the active-set strategy since we typically choose a step size of τ = 1 at the beginning of every line search. Though the theory does contain several strong assumptions to guarantee unconditional global convergence to a stationary point, there appear to be no other genuinely non-smooth function-space-based algorithms for non-smooth non-convex problems currently in the literature. As noted in [46], if one can prove a kind of semismooth property of the reduced objective, then the convergence theory is greatly simplified. In comparison to the other non-smooth methods, it is clearly more costly due to the usage of the line search. The caveats for this method are the need to monitor the step sizes (essentially restarting if they get too small) and a meaningful heuristic for the convergence criteria. Nevertheless, the theory stands in contrast to non-convex bundle methods as outlined in, e.g., [75]. There, in the ideal non-convex setting, the algorithm terminates at a point at which 0 lies up to some tolerance ε > 0 in a convex hull of certain subgradients corresponding to points yi that are not “far away” from the current iterate xk. The connection to the various MPEC stationarity concepts is therefore unclear.